PDF

INDEX
S.No. Topic Page No.
1 Lecture- 01 01
2 Lecture- 02 Machine Learning 25
3 Lecture- 03 41
4 Lecture- 04 52
5 Lecture- 05 60
6 Lecture- 06 68
7 Lecture- 07 74
8 Lecture- 08 93
9 Lecture- 09 107
10 Lecture- 10 123
11 Lecture- 11 141
12 Lecture- 12 182
13 Lecture- 13 224
14 Lecture- 14 283
15 Lecture- 15 311
16 Lecture- 16 327
17 Lecture- 17 336
18 Lecture- 18 343
19 Lecture- 19 349
20 Lecture- 20 354
21 Lecture- 21 366
22 Lecture- 22 374
23 Lecture- 23 390
24 Lecture- 24 403
24 Lecture- 25 410
26 Lecture- 26 419
27 Lecture- 27 431
28 Lecture- 28 442
29 Lecture- 29 446
30 Lecture- 30 466
31 Lecture- 31 472
32 Lecture- 32 478
33 Lecture- 33 482
34 Lecture- 34 488
35 Lecture- 35 494
36 Lecture- 36 497
37 Lecture- 37 504
38 Lecture- 38 505
39 Lecture- 39 518
40 Lecture- 40 525
41 Lecture- 41 531
42 Lecture- 42 541
43 Lecture- 43 574
44 Lecture- 44 587
45 Lecture- 45 597
46 Lecture- 46 605
47 Lecture- 47 618
48 Lecture- 48 683
49 Lecture- 49 699
50 Lecture- 50 734
51 Lecture- 51 742
52 Lecture- 52 752
53 Lecture- 53 756
54 Lecture- 54 765
55 Lecture- 55 771
56 Lecture- 56 781
57 Lecture- 57 801
58 Lecture- 58 818
59 Lecture- 59 825
60 Lecture- 60 833
61 Lecture- 61 855
62 Lecture- 62 861
63 Lecture- 63 869
64 Lecture- 64 875
65 Lecture- 65 884
66 Lecture- 66 907
67 Lecture- 67 925
68 Lecture- 68 936
69 Lecture- 69 948
70 Lecture- 70 960
71 Lecture- 71 965
72 Lecture- 72 968
73 Lecture- 73 973
74 Lecture- 74 975
977
75 Lecture- 75
76 Lecture- 76 985
77 Lecture- 77 1000
78 Lecture- 78 1007
79 Lecture- 79 1018
80 Lecture- 80 1022
81 Lecture- 81 1036
82 Lecture- 82 1046
83 Lecture- 83 1062
84 Lecture- 84 1076
85 Lecture- 85 1090
86 Lecture- 86 1092
87 Lecture- 87 1095
88 Lecture- 88 1114
89 Lecture- 89 1118
90 Lecture- 90 1121
91 Lecture- 91 1127
92 Lecture- 92 1135
93 Lecture- 93 1153
94 Lecture- 94 1166
95 Lecture- 95 1182
96 Lecture- 96 1198
97 Lecture- 97 1222
98 Lecture- 98 1244
99 Lecture- 99 1249
100 Lecture- 100 1254
101 Lecture- 101 1267
102 Lecture- 102 1276
103 Lecture- 103 1290
104 Lecture- 104 1293
105 Lecture- 105 1298
106 Lecture- 106 1317
Machine Learning for Engineering and Science Applications.
Professor Dr. Balaji Srinivasan.
Department of Mechanical Engineering.
Indian Institute of Technology, Madras.
Introduction to the Course
History of Artificial Intelligence.
Welcome to machine learning for engineering and science applications. This is the first video
for the course, so in this course we will be looking at, in this video we will be looking at the
introduction to the course and a brief history of artificial intelligence through the ages. So let
us look at a few things that we are using that are essentially products of machine learning in
real life today.
(Refer Slide Time: 0:39)
1
We have seen Amazon's recommendation system. You would have had several such things at
using other software or other websites also. Basically you buy a bunch of stuff of this website
and it recommends other things, you know maybe books, other books or other product that
you might like. This is Amazon's Echo, it is effectively run by a speech recognition engine
combined with website searches. We have been all using Google spam filter or any other
company spam filter but Google works really really well as a part of our mail system it works
very seamlessly nowadays.
This is Google's Lexus which is a self driving car. So now what is common between all these,
all these 4, is essentially all of them use as an essential part of their algorithm, machine
learning algorithms. Okay. So machine learning very simplistically speaking is a method or a
set of algorithms that you can use to replicate activities typically that require human
cognition. So for speech, humans recognise speech very very well, starting from a baby to
another, we do it very quickly. We all drive, those of us that drive, drive fairly seamlessly,
spam filter, most of us can look at email and almost instantaneously say whether we spam or
not.
However in practice, in order to encode it into an algorithm is actually a difficult task.

Because a number of rules expands very very rapidly, you cannot say that a mail if it comes
from Nigeria is definitely going to be spam, if it involves money it is going to be spam, etc.,
etc. So there is a finite set of rules that you can make but nonetheless you would like a quick
spam filter which works as well as a human being does. So in such circumstances we tend to
use a large basket of techniques, these are called Machine learning models, these are very old
models, several of them are at least half a century or even sometimes even a century old.
2
So we will be looking at many of these algorithms but as far as this course is concerned, this
course, the course of machine learning can be part of electrical injury, it can be part of
computer science, it is a branch of mathematics. We are going to treat it as if it is going to be
primarily used for engineering and science applications. Just to give you a couple of
applications in mind. When you have an x-ray or you have an MRI, that a radiologist looks
like, looks at and stars giving you a diagnosis, okay, like the other tumour here, this is
cancerous, not cancerous, etc., etc.
Can we replicate this sort of process using machine learning algorithm? Just like we are able
to replicate driving, can we replicate this kind of judgement? Doctor Ganapathy for example
is an expert in this field. Another application could be, we will see this in the middle of this
course, that this is actually flow past a cylinder. So what we are trying to predict is not this
thing is if you have a circular body, cylinder in this case kept within an external flow, you
know it starts giving what you see here are velocity contours.
Traditionally even today we use several software and the whole process is called
computational fluid dynamics, often abbreviated as CFD. You might use solid mechanics
modelling or any continuum based, PDE-based modelling, the question is can we find out
models that can do this more rapidly using machine learning. And we will see that indeed a
part of it is possible and this is an exciting field. So our aim in this course is the following.
First thing is to want to understand basic machine learning models thoroughly. In specific we
are going to look at what is now very popular Deep Learning, we will see what it is even later
3
on today. Machine learning models thoroughly and in particular some very fundamental
models that have been used for almost 50 years now, at various stages of development. We
will also look at some modern machine learning techniques, which have caught on as the last
of the last decade or sometimes even as recently as last few years.
We will be looking at several things that have been done in this field, even in the last year,
the field is moving very very rapidly. We located in the context of engineering applications
mostly. And finally what we want to do is to apply these techniques hands on to problems in
engineering. Now within a video course of the sort that we are taking right now, there is only
so much that we can do. We will look at some basic coding paradigms, we will also show you
some examples but the hope is yourself motivated and you do learn and program in Python
by yourself.
We will give you the basic rudiments and will give you several basic examples, you know for
example from medical image diagnosis, from turbulence modelling, CFD competition, etc.
But our expectation is that you will also do some work on your own, okay, whether you the
course for credit or not, you can get the maximum out of this course, in case you tend to code
yourself. So, we also hope that at the end of this course you should be able to read and
understand research papers in machine learning, and especially applied research papers.
You might not be able to understand very hard-core machine learning theory but if somebody
has applied machine learning to some practical problem, hopefully you be able to read a
paper and understand it. This is also a primary aim of the course because most of the
development that is happening today is not necessarily present in textbooks etc. It is mostly
available as research papers, especially an archive. So our hope is you should be able to get
this also, so this is a broad course aim, okay.
4
So it intersects with other courses from machine learning but the emphasis is a little bit more
on the application side. And getting an overview and getting basic idea of the models that are
in play. Okay. So, in terms of prerequisites for the course, what is it that you require in order
make sure that you can complete the course successfully? One of the primary requirements
that you would have is mathematical sophistication. This, sophistication is a vague term, what
it means is that you are comfortable with mathematics, thinking about things in a
mathematical framework.
Rather than just thinking of it in a vague framework which is just qualitative, we would like
you to have a quantitative mindset. While we will be introducing ideas from you know
whatever our essential ideas you need, not the whole of linear algebra or property or
optimisation for that matter. But we will be introducing you to some basic ideas that are
required for the course, okay.
Nonetheless that portion will be one week each, but more importantly you must be
comfortable, okay, whenever we talk about something in mathematical terms, especially
when it comes to probability, you should be comfortable in thinking through these things in a
mathematical framework, that is one. Similarly, slightly similar in this respect is you should
be comfortable with programming. Hopefully you have written programs at least in some
language, we will be using examples from Python, examples from MATLAB. So we expect
you to be able to at least understand syntax from Python and MATLAB.
But you will get the maximum out of the course if you are actually comfortable with
programming itself and you can do some hands-on exercises. We will be giving suggested
5
exercises throughout this course, so hopefully you should be able to do these, especially if
you are trying to take a course for credit.
So here is an outline of the course. So we will have 3 broad parts of the course, the first is
artificial neural networks and Deep learning, this includes what are called CNNs. CNNs are
convolutional neural networks, these are used for vision. RNNs are used for typically
sequential data. So we will be using RNNs, CNNs as well as what are typically called Simple
ANNs and this is the first part of the course. The 2nd part of the course is other classical
techniques that are being used for a long time.
They are still applied in various areas, depending on the complexity of the problem, Tree-
based methods, support vector machines, probabilistic methods, etc, this is the 2nd part of the
course. Finally we look at some modern techniques such as derivative, adversarial networks,
etc. and reinforcement learning if time permits. As far as applications are concerned, within
each module we will discuss various applications for each of those modules as we go for
during the course.
6
So here is the syllabus for the course, this was announced on the website also. So the first 3
weeks essentially are the basics, okay. So all the fundamentals that are required for the
course. For the first week we are going to look at linear algebra primarily, then the 2nd week
will be probability and statistics, whatever basics are required, visited the whole course in
itself not just whatever basics are required for this course. The 3rd week would be whatever
numerical computation and optimisation basics you require. And also popular machine
learning packages that are available today, we look at an overview of those.
Week 4 and 5 is essentially neural networks. So it is possible to think of even linear

regression are very very simplified neural networks. So we look at linear regression and
logistical regression, these are 2 basic algorithms, the simplified algorithm, then complex
neural networks and multilayer neural networks in the next couple of weeks. The next 3
weeks are essentially variations on neural networks. So this involves convolutional neural
networks, this is for vision. Vision-based problems are usually solved using convolutional
neural networks.
Recurrent neural networks are typically used for sequence-based problems, okay. So
sequence that develop in time for example, So timeseries analysis in some sense can be used
using recurrent neural networks. Then we look at classical techniques, techniques that have
been around for a long time and they are still used in conjunction with Deep learning and
neural networks. Some probabilistic techniques also we will be covering, for example
Gaussian mixture models, etc. Okay.
7
Unsupervised learning will also be covered here, finally we will look at some advanced
techniques, there might be some changes here as we go forth in the course, depending on how
students are doing. We can also add a reinforcement learning if time permits towards the end
of this.
The reference books for this course. The first is by now, even though this is published only in
2016, it is already treated as with is a classic text. It is a very good text called Deep Learning
by MIT press, Goodfellow and Bengio, etc are all researchers in the field. The 2nd is Pattern
Recognition and Machine Learning, this is also a very very good text, though it is a dense
text, it is a little bit harder to read but it is an exceptionally well-written text, very very
thorough text by Christopher Bishop.
And the 3rd is towards the practical implementation side of deep learning, this is by Francois
Cholle, it is called Deep Learning with Python. Now fortunately the first 2 texts are actually
available for free. Okay. These have been made available by the publishers themselves, this is
legal, you can search for these texts and you will find websites where these books are being
checked by the publishers themselves. I would very highly recommend that you go forth and
take a look at it and also read through these texts as the course progresses.
8
Please use the resources that have been generously made available by the publishers for those
we are really grateful for them to having done that. So, now let us look at the history of
artificial intelligence through the ages. Idea of artificial intelligence has been around for a
really really long time. It is almost as old as whenever we started making tools. Why is it we
are covering it, one, to see that many of the ideas that we are covering are actually quite old,
also to see the ebb and flow of the ideas, when the ideas go up and when they come down and
some of the ideas we might be covering in this course right now quite suddenly become
unpopular for 5-10 years but suddenly might become popular again.
9
So it is a good idea to know where a topic comes from. So as I said history of artificial
intelligence is really really old. Humankind has been fascinated with tools, they have been
fascinated with what we can do, by looking at our hands we have started looking at you know
what kind of machine tools can I make so that I can replicate the motion of my hand. So
similarly just like we have been thinking about mechanical tools, we have also been thinking
about thinking tools.
So can I not only replicate whatever I am, motions I am making, the wheel for transportation,
the hand for working, the lever for lifting, etc, etc, not just that but can we actually make
tools that can ease our thinking. This is Raju Bhoj of Parmar, this is from Bhopal. So this is
speculation, I do not want to say that he actually made a machine, he did not. So within his
works, rather bought was very very accomplished person, a poet, an engineer, he has a
fantastic civil engineering work, all sorts of things.
He also speculated that you could have proposed that imitated or replicated human speech
and motion. And even before him and after him there have been several people throughout
the world increase, in Rome, etc. who have been doing this. There were realistic automatons
in several parts of the world, right from prehistory till date. One such example, though it is a
fake example, is what is called the Mechanical Turk. This was claimed by essentially a con
man that he could make a machine that could play chess.
What he actually had was a person inside, a person who could play chess inside. But
nonetheless we know that there were several automatons, that is things that could move
10
automatically and replicate at least human emotion. Now Leibnitz, he is also the father of
calculus along with Newton had an idea of a calculus of human ideas. We will come to this
entry go on later. But Leibnitz idea was that every thought that we have is a combination of a
few axiomatic simple basic units of thoughts.
We do not go that way but you can see an analogy with how modern machines are working.
How do modern versions work? Modern machines or modern computers work on the basis
that everything can be broken down zeros and ones. So the basic question that machine
learning people have been asking in a very broader sense is to see whether all that we do in
terms of thinking, I terms of creativity, can it be broken down into a few elemental products
or a few elemental processes.
So Leibnitz speculation was that this is indeed possible, okay. Charles Babbage made the first
or at least conceptualised the first analytical engine, modern-day computers are very very
similar to whatever Babbage actually conceptualised. This was right back in 1837, the
fructification of this was in 1940s when the first computers were made. My point is whatever
we think of computation today, it is actually amazing, if even great work from the 1940s,
they could have conceptualised what computers are doing today.
In some ways you can even think of computers today as artificial intelligence. In that you can
book a ticket, you can record speech, you can play music, you can see movies, this is a wide
variety of tasks, mind you. All of those are done on one single humble computer. In some
sense this is already artificial intelligence, what we are going to do at least in this course is a
little bit further. We know that while this is going on, it is not really thinking, it is not really
learning. Our idea is to see if we can make algorithms which can actually learn.
11
So here is the birth of artificial intelligence. 1914 was the first chess playing machine, all it
was doing was king rook and king ending, those of you know how to play chess can, we will
know that if one party has a king and rook and another party just has a king, you can
checkmate And, always. So here was a machine which would actually do that, okay.
Surprisingly enough, the first driverless car was right back in 1925, this was made with the
help of the U.S. Army and I think Francis Houdina, this is not Houdini of the magician fame,
this was a different person altogether.
So he made the first driverless car, however this was radio controlled. See, even this was
quite amazing to people right back then. The first theoretical progress happened in the 1940s,
we had the first artificial neurons, we will see this when we will come to neural networks.
Turing also first proposed the idea of theory of computation and the idea of a universal
computer. The idea of a universal computer is the idea that one single computer can do all
computable tasks.
This seems obvious to us, we have sort of grown used to it but it was not obvious right in the
beginning that every single computation can actually be done. You can think of a ticket
booking as a competition, you can think of playing a video as also a computation, all the of
that can be done on a single universal computer was not obvious at all. So Alan Turing was
the person who actually pioneered this idea. Shannon also came up with information theory,
this is now also being used extensively within ideas in machine learning and of course in a lot
of places like signal processing, etc.
12
1950s was in some sense a place when artificial intelligence took off whether you read
science-fiction from that period or whether you read just normal research people writing. So
Norbert Wiener came with this idea of cybernetics, it was very very popular. Lewinski is a
very famous researcher in the field, he made the first neural net machine, stochastic neural,
automatic reinforcement calculator a SNARC. And Simon and Newell, Simon was a Nobel
Prize winner, he worked for all his life for decision theory and affectively artificial
intelligence, what we call artificial intelligence today.
So they made an automatic theorem proving machines. So the first machine was, this was not
suffer but it was actually hardware, the first neural network machine was made by Lewinski
in 1950s. 1956 was the first coining of the term artificial intelligence, a famous conference
call the Dartmouth conference, Simon, Newell and Shannon all 3 of them participated is this.
And the sentiment was really really positive. Rosenblatt was the first person who came up
with a 2 layer artificial neural network called the Perceptron, he unfortunately died really
really young, I think he died in the 1960s at the age of 40 or something.
Now another researcher, Arthur Samuel, just to define machine learning a little bit more
precisely than we did in the beginning, is to say that machine learning is a field that gives the
computers the ability to learn. This is what is key, the ability to learn, we will define learning
itself a little bit later, without being explicitly programmed. So we will come to this distinct
and shortly, okay.
13
So here is the idea of being explicitly programmed. This is the idea of an expert system, so let
us say you want to find out or you want to make an algorithm that detects grammar errors,
okay. So some way to say it is you start putting all the rules that you know of English
grammar, let us say, into the machine saying that if this follows that, you know if it is a
singular person, then you put S, if it is multiple people, put R. But what happens shortly is in
many cases it is a really really hard to program all the rules. And we will see some examples
as we go on forth.
You will see that it is quite hard to do it even in grammar. It is not called clear how it is that
human beings are able to recognise different grammars for different languages. Most of us in
Indians speak at least 2 to 3 languages and most of us can seamlessly switch from the
grammar of one language to the grammar of another language and it is not clear what sort of
roles before low. So, expert systems work well when the rules are clear and when the rules
are not clear is typically when we would like to use machine learning, okay.
So when in 1960s, when people thought of artificial intelligence, mostly they were thinking
of rule-based systems. Even in chess, for simple games like tic-tac-toe you can use usually,
quickly give the rule that will let you win or at least not lose. But in more complex such as
chess or Go, it is actually hard and that is what people have been finding out. So, but early
days of artificial intelligence, whether it was playing chess, making organic chemistry
models, solving world problems in algebra, etc., etc., and even in understanding natural
language, some progress was made but it was not good enough precisely because of this.
Because of the fact that these were completely rule-based, you had a rule for every single
case. And if you did not give a rule, the company does not know what to do. There was also
theoretical progress, we will see this algorithm is what makes machine learning algorithms or
at least neural networks learn. Back propagation, this was available way back in 1969, that is
nearly 50 years ago today in 2018. But the progress was good and people were very
optimistic, here is Herbert Simon again, he is a Nobel Prize winner as I said. He thought that
the machines should be capital by 1980s of doing any work a man can do.
Unfortunately, till today machines cannot even do what let us say mosquitoes or rats can do.
So we are nowhere close. Nonetheless we have made a lot of progress, which is why the
course is here. We also had Lewinski saying something extremely positive it was very very
similar. But whenever you have such kind of a hype cycle, you should always know that you
14
are going to get into problems. So what was known as the first AI winter happened between
1974 and 1980.
The problems were that all the results that were there even within chess were primarily for
simple toy problems. The kind of exercise problems we do in any course, simple problems.
The computational power was also exceedingly low, today Cellphones probably have greater
power than most of the big mainframe machines had back then. So the computational
problems we are looking at, in those days were really really low. Combinatorial explosion,
especially for rule-based systems, okay. So like I said there is no simple finite set of rules
within which you can explain every single case of grammar, okay.
It is really really difficult. Even for chess you cannot give a rule for every single situation, it
is just too expensive. Combinatorial explosion means when you go from a small problem to a
larger problem, the number of choices which are expanding is really large. If I only have one
knight rook or knight ending, I have only 3 pieces, that is slightly less complex. If I have 5
pieces, you have things start growing in terms of factorial is, in terms of power laws and that
is very very hard for a computer to handle. Okay.
So the key to even modern-day machine learning is this idea, that commonsense is nearly
impossible to program. A baby today can look at a face and recognise that it is father or it is
its mother. But you actually program in and to say why it is that it is its father or why it is that
is the same person, regardless of this person changing their clothes, changing the expression,
changing the way they speak, then growing older, having a beard or not having a beard. It is
15
really impossible for you to program every single case in. But, somehow magically human
beings tend to do this really really rapidly, okay.
So how is it that this happens is, of course it is a long-standing problem in cognition, it is still
an open problem. Nonetheless we do know that it is nearly impossible to program in this
explicitly. That is at least when we have rule-based programs, you cannot do this that easily.
In fact probably, I will be bold enough to say that you cannot do it at all, which is where
machine learning steps in. One other thing that happened during 1974 to 1980, was
Lewinski's Perceptron showed up.
He made a very simple argument which we will make later to, this was not an unknown
argument. The fact that very simple neural networks which are called single layer neural
networks and not solve some knowledge problems. It is kind of obvious as you will see later
on in this course, it is a very obvious argument. He also made an argument that multilayer
neural networks are hard to train, so that was also made. What we mean by training is that
somehow it is difficult to kind of Automatically program, which is what we are going to go
into this course.
That this is very difficult to do, if you have more than one layer. If you do not understand that
a player, that is okay, we will see this later on during the course. Anyway, this actually set up
a panic but it will be set up a panic in conjunction with the fact that already results were
hyped. And what happened was there was a lot of loss of government funding in AI and
obviously most of the research work stopped.
Nonetheless some brave pioneers continued and as usual, you know, you have this kind of
boom bust, boom cycle that keeps on going on in many fields, machine learning is just one
example. One boom was between 80 and 87, we actually had the first driverless car then, this
was, if I remember right, this is by Mercedes. And there was huge funding, if I understand it
currently, about 750 million pounds were invested in driverless cars right back then. And it
did not come to anything as we do now, now of course Tesla is taking over, Google Lexus,
etc.
16
There was a book because primarily some builders are using expert systems, popularisation
of back propagation and as usual Lewinski said that you know, we are going to probably
going to go into first and it did happen. Between 87 and 93, the PC became very popular, you
are not looking at large computation but individual people are able to do small word
processing and, etc. for their needs. They are not looking at grand aims like General Machine
becoming smart, etc.
Again there were a total funding cuts, usually precedes long winter in AI. Now between 94
and 2000, people got a little bit smarter and there was a long period of consolidation. Some of
you might recall or you might know that in 1997 IBM's Deep Blue beat Kasparov in chess,
this was still a rule-based system, it was not a machine learning system. A machine learning
system playing chess has come only last year, again by Google's people, it is an extension of
Alpha Go, I think it is called Alpha Zero.
So in 1997, Deep Blue beat Kasparov in chess, almost all stock fish etc., all chess engines
that exist today are still rule-based. There was also simultaneously development in theory,
including probability theory, information theory, optimisation theory, good optimisation
algorithms which will be using in this course. And of course there was the stupendous power
of Moore's Law. Moore's law is the law that number of transistors doubles every 2 years.
17
Now an adapted version of that is, you know we have sort of made it a tool. On y-axis here is
the number of calculations that you can do per second per unit money that you are spending.
That, this is a log log scale on the y-axis you have to log scale, this is a linear scale. And you
can see that you have exponential growth of computational power. At least computational
power in terms of the cost it takes for you to do this. So integrated circuits are there, we are
now predicting quantum computation which is at least supposed to help or not necessarily
interpretation but at least in some types of algorithm it is supposed to help even 40 planning.
Google and other companies have invested very deeply into it. We also have GPUs, that
allows much cheaper computation. So Moore's Law in terms of growth of computational,
exponential growth of computational power has really helped. So 2000 to 2012, I would call
18
the quiet years but they were quiet in terms of artificial content but there were very
significant developments. The company Google was born, not only Google, there were
several search engines, of course Google used it really well and came up with good
algorithms for that.
What it helped was there is a large number of searches and when there is a large amount of
searches, there is a large amount of data. So the key thing for machine learning, machine
learning is very very data hungry as you will see was the amount of data that subjected itself
through statistical analysis and statistical techniques. This is what happened between 2000
and 2012, there was an Internet boom, once again a lot of people offering a lot of products, a
lot of people offering a lot of data, images, videos, all these came together and you had a
large database on which you could train.
By train you will see what we mean later on in the course. Also we had Nvidia, with GPUs,
these are very packed computational power horses. And we had specifically good results, if
you brave researchers were continuing their work using deep networks. Another thing that
people did very pragmatically was instead of looking at some bold aims like Machine
becoming intelligent on its own, they started looking at very specific outcomes.
Instead of even saying stuff like I want a machine that will recognise everything when it
looks at it, which would be the computer vision problem. It would say can I have a machine
that can read a postcard and read the pin code. Just in that case it needs recognise only 10
digits 0 through 9. So, can, such specific outcomes really helped in getting good results. That
is what led to the boom, even today you will see specific algorithms for vision, specific
algorithms for natural language processing in a specific area.
Instant of saying something that can read your mail and understand it, can I have a spam
filter. That is a specific outcome and in such cases you can actually find out which kind of
machine learning model works better than the others. So this led to good, very good results
and led to a positive growth of the field. 2005, once again we had autonomous driving for
about 135 miles, of course without any interruption. What people are trying for now is more
sophisticated. Can you actually drive on a street while people are moving around and we have
had good results with Tesla.
Also one important result, I think this was in 2011 or 2012, I am not sure, IBM's Watson beat
the Jeopardy champion, this is a quiz show and it is a nontrivial quiz show, it is not a simple
19
language show. It has puns, it has plays on words, so the machine needs to understand more
than how human sense in some sense, okay. And after this period of consolidation, we are
now within what would be called AI spinning. We do not know whether it is ended, some
people are already saying it is not it but anyway, all of us know that we are known growth
cycle right now.
There has been a lot of what has sort of pushed it, there is a lot and a lot of private funding,
not just government funding, in fact governments are catching up. Google has its deep mind,
IBM has its own thing has its Watson, Facebook has its things, Microsoft has its things, and
all of them have been fairly democratic about sharing their resources also. There has been
rapid growth in computational power as I said earlier GPUs, etc. And a very important
portion of it has been rapid growth in data.
Widow Facebook has data, Google has data, Microsoft has our data and they have been doing
a lot of data mining, legally Hopefully and this has led to a lot of growth in machine learning
itself. A lot of people have done voluntary distributed network in let us say tagging images,
some of us have done it even though semi-voluntarily choosing captchas. Captchas are these
things, you know you will have digits like N123 etc., that is just popped up in order to
identify whether your human being or a Robot. But what it has done is it has also helped
machines being trained.
Each time you say this, a machine recognises that this kind of image probably means N, this
kind of image probably means 1. So that has been used also in training. So, voluntarily and
20
semi-voluntarily we have been doing a lot of training for these machines and that has led to a
lot of data. Games also have led to a lot of data.
The inflection point, you know the point where a lot of people identified as the real growth of
machine learning in AI, at least the modern boom cycle is sometimes in 2012. There is a
challenge called the image net challenge which we will see them become to CNNs. So this is
just a vision recognition challenge, out of a thousand categories of images you have to say
which one is which, is this a cat, is it a dog, is it a building, etc. So, most of the algorithms at
that point, all of the algorithms at that point which had the winning work in some sense
traditional vision based, rule-based algorithm.
So 2012 was the first time that a machine learning algorithm called convolutional neural
networks which one, this is called the algorithm which one is called Alex net, which we will
be covering in detail later on in the course. This was 2012 and since then every single year,
the algorithm which has won is only a machine learning algorithm. Alex net showed a huge
jump in performance based from, about 12 percent jump from previous algorithms. And this
is when people sat up and took notice and since then the field is just taken off very rapidly.
The number of people who have got in within the last 5 or 6 years is just huge. People who
started out doing their Ph.D.'s in 2012 without a machine learning algorithm have done things
that are surprisingly chaste and the course of their Ph.D. This is not very long but a lot of the
material that will be covering in the courts will actually be from the last 3 or 4 or 5 years. We
21
will be covering classical techniques but will be also covering what has been specifically
done in the last few years.
Which is another reason that we are asking that you also understand how to read research
papers because the field is still developing, it is still in some sense early days and you need to
know how to keep up with the literature. So, part of the language and the techniques that we
will be introducing you through the course is for us to make sure that you can actually read
the papers, understand and maybe implement them yourself in some application of your
interest.
So here are 2 results that have been the reasons for people getting kind of worries. This is a
machine versus humans, this is the same thing, Jeopardy, this is a quiz show in United States,
I do not believe this is there in India as yet. But it is not quite sure and it is an involved
language based quiz show, of course it is knowledge-based also. But there are puns, etc. and
you have to understand allusions very clearly. So the thing that wants was IBM's Watson and
it is a machine learning-based algorithms or at least semi-rule-based semi-machine learning
based.
Another thing that was frankly a shock for many people in the field was Alpha Go. Go is a
game which has simpler rules than just but it is known to be combinatorially much more
harder to solve. It is a 19 by 19 board, where people simply place a white or a black piece.
But nonetheless, it was often known to be a hard problem problem in AI, that is chess was
22
thought to be essentially solved by a rule-based system. People thought that there would be
no machine which will beat a Go champion in maybe another 10 years.
Even Google when they came up with Alpha, were not sure that they would actually win,
their aim while playing Lee Sedol, the Go champion at that point was simply to maybe win a
game or 2. And then to learn and make the system better but it actually beat Lee Sedol
handsdown. And after that I think they have retired Alpha Go records it is just has been so
good in betting every single human being. Practically it is unbeatable at this point. So one
thing that is also true about machine learning algorithms is sometimes it is hard to know how
good they will be.
So these are the 2 of the recent results and they have been cause for again a recent hype cycle.
So the question is, is there anything different this time and the answer is yes, at least there are
few tangible things that are different this time. We generally have better technology, okay,
and computational problem that is available today is exponentially such as compared to the
difference between let us say 1950s and 1970s, between 2 cycles. So we have GPUs, then we
have all sorts of futuristic computational technology that people are proposing.
Whether they come through or not, Moore's Law is still holding it is breath and so it is kind
of running its course already. It is going for but we have different architecture. We have
really really big data, we are practically drowning in data and probably we need better
algorithms to handle this kind of data that we have. An important portion of the current boom
23
cycle is we have had democratisation of resources. A lot of people have, a lot of algorithms
can be run on a simple laptop that is accessible to most people today.
Between 75,000 rupees to 1 lakh Laptop can actually have a simple GPU, a simple card that
can actually do a very good job and even some of the simpler algorithms can work. Also
many of the commercial companies have been very very generous with their software. All the
packages, whether this Google's Tensor flow, Facebook's pie torch, etc., etc. IBM, Microsoft,
all of them have made a lot of resources available to the common public. So the open source
movement has also taken off and this has led to a lot of software which is being available to
the common public.
We will be using a few of these through this course which actually makes a person come up
to speed very quickly. Even if they do not know how to code something from scratch, they
can use existing packages are at least use a view of those algorithms. We will be seeing that
later on in this course. And there have been generally better algorithms, even though there are
variations of prior algorithms, we do genuinely have better algorithms today. So, our focus
for the rest of the course is the algorithms portion. What we are going to really look at is what
organisms work and under what circumstances do they work.
We are going to look at algorithms as if they are models, okay. So, if you have done any
engineering problem at all, you will know that sophisticated processes we have various
models. If you are in fluid mechanics you will have various models of how a fluid behaves,
you will have various models of how turbulence we have. Similarly if you are an solid
mechanics, you would have seen various models for how stress strain should be model, etc.,
etc. In every field, an ideal gas law is a model, Ohm's Law is a model for how current and
voltage and resistance play together.
So all these are models, with think of algorithms as if they are models. Modelling what, the
specific input and output relationship in any problem that you are looking at. So, suppose I
have a mail and I am going to classify this as spam or not spam. So, there is something that is
going on in my range which is modelling this. What sort of model will work best is what we
are going to look at through this course. Under what circumstances, if you have a vision
problem, what kind of range of models do we have.
If you have a time sequence problem, what kind of models we have, this is what we are going
to look at for the rest of this course. Thank you.
24
Machine Learning for Engineering and Science Applications.
Professor Dr. Balaji Srinivasan.
Department of Mechanical Engineering.
Indian Institute of Technology, Madras.
Overview of Machine Learning.
We will be looking at an overview of machine learning algorithms. In the last video, we saw
a brief overview of the history of machine learning. Today we will be looking at a broad set
of ideas that play themselves again and again in machine learning.
So here are some common terms that you would encounter if you have just been new to
machine learning. One is the term of artificial intelligence. Artificial intelligence is a very
broad term, it simply means animator that tries to replicate the results of some aspect of
human cognition. The reason the word results is being emphasised, is because we might not
actually replicate the processes themselves but only the results. So, if somebody is playing
chess, somebody is driving car, all you want to do is to make sure that the final output is the
same, whether it is a machine or whether it is a human being.
As against this, machine learning is a specific term, that means programs that actually
perform better as your experience grows. What is meant by experience is something that we
will discuss a little bit later. At what it means is if you have, let us say Calculator, the
calculator is not getting better. You know, as you ask it to do multiplications again and again
25
and again, but if a human being is there, the person might actually get more accurate or faster
as they do multiplications for a while.
So, machine learning, if suppose to replicate this process which is as experience in a field
grows, whether it is spam detection, or whether it is vision or anything of that sort. Machine
learning if the set of algorithms which actually gets better. Artificial intelligence might or
might not actually get better with experience. You would have also heard the term neural
networks or artificial neural networks, there are type of machine learning algorithm.
And most commonly, you would have heard a term Deep Learning, which is a certain type of
artificial neural network. Nowadays it is being used in a broader sense, but more technically,
all it means is a neural network with a bunch of layers, which we will see later. Finally, you
would have heard the term Big Data, this is not a term that we will be using as far as this
course is concerned but simply it is a set of statistical techniques, which we also use within
Machine learning. The basic idea between, in big data, let us say often used very
commercially, is to find out an obvious pattern.
In Machine learning typically is to find out patterns which are obvious to human beings, but
might not be obvious to programs. Okay. But big data is typically try to find out patterns
which are not really obvious to human beings. So as far as this course is concerned, we will
be looking at primarily these 3, we are not looking at big data techniques or more general
artificial intelligence techniques.
26
So here is a kind of Venn diagram to show the relationship between various terms. This has
been kind of adapted from Goodfellow's book. So artificial intelligence as you can see is a
broad term, broad term that encompasses a lot of things, it also encompasses rules-based
learning, which I discussed in the last video. Machine learning is explicitly not rules-based,
which we will see a little bit later. And in deep learning is a particular subset of machine
learning itself.
So what is machine learning? If you are completely unfamiliar with the field, you might think
it looks something of this sort, this is obviously not true. Okay, it is not a machine which is
reading books or trading information and trying to learn something. A very simple definition,
which is from Gou is that it is simply using data to answer questions, more specifically an
actual machine learning algorithm looks more like this rather than this. It actually looks like
this, this is an algorithm called support vector machines, will be saying this later on some
way towards the middle of this course.
So, support vector machines or other machine learning algorithms work as follows. Machine
learning is simply a study of computer algorithms that actually improve automatically
through experience. So the term experience simply means lot of data, okay. A formal
definition, which is there in the textbook by Tom Mitchell, this textbook is called machine
learning. It is that suppose you have a task T, okay. And you have some experience E on it
and you have a performance measure P.
27
A standard example, task T could be let us say recognising spam. Okay, suppose you have
emails, you want to recognise whether the email is spam email or not. The experience E is the
data that you give, you give mails and label them spam or not spam. So this would be the
experience that you are giving this program. P is the performance measure, the performance
measure is how many or what fraction of emails are you labelling as spam. Okay.
So what this definition says is, as E increases, the performance should get better and better.
So any algorithm that achieves this is called a machine learning algorithm. So, here is a
machine learning paradigms, okay, so this idea is adapted from Pedro Domingo, he has got a
very good book. And actually multiple sources, he has a course online also, a book called
Master Algorithm, which is a popular kind of book. I would recommend that you read it.
Also Francois Chollet's Deep Learning with Python book have this idea.
So this is a classical programming thing. So you have certain rules and your certain data, it is
processed by the program and gives answers. For example if you have classical programming
approach to spam detection, you would have certain rules. For example if there are too many
caps or if the email talks about money and puts a dollar in the middle, something of that sort,
those would be the rule. Then the data would be the emails that you are giving it and once the
rules in the emails are given, it will give you some answers, spam or not spam, okay.
So the important thing here is these rules are fixed, that would be classical approach as
against a machine learning approach. Now machine learning approach is as follows. You give
the data which is still the same set of emails, you also give the answers which is, whether it is
28
spam or not spam, and it figures out the rules for itself. Okay. What is the rule that maps this
data to this answer. Okay, so this is the basic idea of machine learning which is you have to
find out a mapping between your input and your output. In this case the input is that it out the
emails and the output is the answers, whether it is spam or not spam.
In other cases you could have data like, you have an image, is this a cat, is this a dog, is this a
horse, those are the answer. So to show it thousands of images of cats, dogs and horses and
you label each one, this would be an example of what is called supervised learning. And then
it finds out what rule is it that we are implicitly using in order to figure out what a cat looks
like, what a dog looks like, what a horse looks like, etc., etc. So you can use this kind of
paradigm for practically everything, as you will see throughout this course.
So when is this kind of machine learning useful? It is not a generally a good idea to use
machine learning when you are actually very very clear about the rules. So this is some,
generally this is true, we will see some exceptions for this. One thing I will mention is,
typically a rule of thumb is do not use if the rules are very concise and clear. Okay, so there is
no ambiguity about what the rules are and you are not a victim of combinatorial explosion, in
such cases machine learning is probably not the best thing to go for.
However in cases where experts are not able to explain their expertise. For example, you
drive a car, how do you drive a car, it is not very easy to concisely explain it into a set of
finite rules, that this is how I am driving a car, this is how I recognise that something is spam
or not spam. It seems kind of obvious to us when we see our friend, whether this friend has a
29
cap on, different shirt on, we can immediately recognise that this is the same friend, that our
parent is so-and-so, even a child recognises this fairly quickly.
In such cases, when we are not able to explain our expertise, it usually means rules are
difficult to extract. The more obvious it is, the more difficult it is to extract the rules, okay.
And usually will have combinatorial explosion, that is that the problem gets more and more
complex, even for slight amount of increase of complexity, the number of rules you will have
to give are too many. In such cases, it is usually better to use a machine learning paradigm,
that is to simply say this is my input, this is my input, figure out the rules for yourself.
In certain other cases, even if you might note the rules, though the examples that I have used
here, even there navigation is a hard problem. Even for hazardous environments, it is usually
a good idea to use machine learning or any other artificial intelligence algorithm. Also when
you have solutions that need to be an adapted to very specific cases. For example if you want
a patient specific treatment for their particular, for their particular allergies, again the number
of rules that you will have to give will be too many.
So in such cases also machine learning can be quite useful, okay. So here is the fundamental
trick that is utilised in most of machine learning. Almost all of machine learning, this is, uses
this fundamental idea which is every problem that you have, whether it is a face recognition
problem, spam recognition problem, you know fluid mechanics problem, whatever it is, every
problem can be posed as a data problem. Okay. A data here means something involving
numbers. Okay.
30
And all solutions that we offer can be thought of as a function or a map, okay. So, here is the
problem, so for example let us say we are doing an image recognition problem, I will go back
to the same example. You have an image, he will not recognise whether it is a cat or a dog,
okay. So the problem is when we get sensory input, this is as qualia or basically we get
qualitative inputs. These are not numbers. So when you see a cat, almost invariably all you
see is a certain features of the cat, use the eyes, ears, nose, etc., you do not actually see
numbers.
However if you want to turn it into a data problem, you will actually have to somehow
change this from an image to numbers. Okay. So these images which we are getting as inputs
for our problems, these qualitative inputs have to be turned into numbers and after this
transformation, this is called an input vector. This is what goes into the program, okay. So
when I have the box and we have data coming in here, that was this. Those are the input
vectors that you are giving to the problem.
Similarly you have output that we give, so let us go back to the same example. If I see an
image I can call it a cat or I can call it a dog but cat and dog are words, these are not numbers.
You again have to turn these into numbers as well and these will be called output or target
vectors, okay. So they are the answers in the previous slide, these also have to be posed as
numbers. There is a slight difference between output and target vector. Output vector is what
the machine will give out, in the final case target vectors are what begins as examples in the
middle, we will see this later. Okay.
So, an essential part of the process of machine learning is to somehow decide on what are the
appropriate inputs and what are the appropriate outputs. This can also be easily turned into
numbers and with which you can train your algorithm. This is an essential part of the process.
Even the rules of the get out have to be finally posed in terms of formulae, programs or
numbers, okay. Now the learning task is to find a map that takes the input and gives out the
output. So this could be thought of as a function that takes in an input vector and gives out an
output vector. Okay, so this is a function or a map that does this.
So this is the fundamental trick that we will always use. Any problem you have, whether it is
a cognitive problem, any problem can always be turned into a problem which takes in a
bunch of numbers and gives out a bunch of numbers and what we want to find out is what
will map these input numbers into the output numbers. So this is the fundamental idea behind
most of the machine learning. Let us come to various types of learning problems.
31
32
Now, before I go into this, I want to point out that even though we have split into several
types of learning approaches, this has been done traditionally, not all of them have clear
boundaries. So you might find a case that goes into one type of learning approach or the other
type of learning approach, let us see a few. So, one of the most popular, even the examples
that I have used most commonly are what is called supervised learning. Supervised learning
is data which is labelled by human experts. You have somehow labelled this data and you
have set for this input, this is the output.
An example is something of this sort, let us say you have some log data points, each of these
data points could represent anything, please remember from the last slide, each point here
could represent a whole image because any image can also be turned into a vector, it can be
turned into a subset of numbers, okay. So let us say we have 3 types of data which you can
see, you have one set of crosses which are blue, one set of squares which are black and one
set of circles which are red.
And suppose somebody gives a new point, which is here or someone gives a point here and
you want to find out whether it is of type cross, type square or type circle? Okay, this is a
supervised learning problem. The spam-not spam example I gave you was also a supervised
learning problem because each example you gave, each email you gave you also
simultaneously said is this spam or not spam. So your dataset if it is labelled by an human
expert already and tells you example outputs, that is a supervised learning problem. Okay.
33
So some examples are labelling images, speech recognition, optical character recognition
which is to turn written stuffs by human beings into actually finding out whether this is you
know, is this S, P, etc, which is called optical character recognition. When you do
handwriting recognition, or printed material recognition out of images, that is a supervised
learning problem. So large parts of problem actually can be turned into supervised learning
problems. Another important category is what is known as a supervised learning. In this case,
the label for the data is not given.
So let us see the same data here, except the difference is that instead of giving Cross, square
and circle, I have not made any distinction between the data. Nonetheless, as human beings
we can automatically recognise that there are some clusters here. Then this might be one type
of vector, this might be another type of data, and this might be a 3rd type of data. In such
cases, supervision order labels are not given, nonetheless we are supposed to automatically
recognise the natural clusters that are forming, okay.
So such cases can be used in multiple applications, such as you know, you have, let us say
customers about 40 are purchasing in a certain way, customers believe 20 are purchasing in a
certain way but you do not know a priori that these are customers above 40 and these are
customers below 20, etc. But you see certain buying patterns, in such cases, you know the
data will naturally formed clusters and the machine is supposed to recognise automatically,
even though it seems obvious to us. At the machine is supposed to recognise through some
algorithm that this is one cluster that is another cluster.
So in such cases detecting new diseases, finding out something like credit card fraud, a
customer has been pertaining in a certain way for a long time and suddenly there is a change
in pattern of purchase, that would be an anomaly detection and that is a type of unsupervised
learning problem. There are some types of learning approaches which often lie at the
interface of supervised and unsupervised learning.
34
One set of problems that we will be looking at are what are called as generative approaches.
The idea behind the generate approach is to create a new data, that is somewhat like a given
set of data for the files Apple if I show you 100 images of cats, any human being can try at
least and draw in new cat which will not look like the 100 images that you already saw but it
will look somewhat different but it will at least extract key portions of a cat.
So such a learning approaches called a generative approach. This is neither labelling, nor
clustering what it is actually generating new data. Typically this is included within
unsupervised learning. We will be covering generative approaches towards the end of this
course and also during some sequence learning. There is another type of learning, this is
called semi-supervised learning, this is also quite possible, especially in medical images. You
have small amount of labelled data available, along with unlabelled data.
So you have let us say if you are MRI scans, and you have some let us say labelled tumours
etc. within that. But you also have a lot of other data where the expert has not been able to go
over the data. In such cases you kind of leveraged the labelled data and then use the amiable
data and start solving a full supervised learning problem and this is called semi-supervised
learning. There is also something called self supervised learning, where you actually do not
have any labelled data at all, but you can kind of figure out some implicit labels, okay, from
data using heuristics.
An example of this what was called auto encoders, which we will cover later on in the course.
Another example would be something like you have a few video frames and you want to
35
predict the next video frame. In such a case, you would kind of use self supervised learning,
okay. Finally, we have something called reinforcement learning, which is getting a lot of
traction nowadays. So, in such cases the easiest example for reinforcement learning would be
something again like chess or any video game that you play.
So you make a move and you know maybe 20-30 moves later you get to know and you get to
know only one thing, it even, did you lose or did you draw. But early on, 20-30 moves ago,
you do not know whether that particular move that you chose led you to win or led you to
lose. Okay. So you are trying to find out what action to take at a particular point based on
rewards that are really really far removed in time. Okay. So unlike, let us say simple
supervised learning, where I show you an image of a cat and say cat, here you are making a
move, you do not know whether the move is right or wrong, whether it led you to win, lose or
draw but you know the result of a combination of moves after a long time, okay.
So trying to learn under such an environment is called reinforcement learning, we will be

looking at a brief introduction to this also towards the end of the course. Now again I will
repeat the same point that I made earlier, which is the distinction between these various
classes is actually quite blurred quite often. We will mostly approach the course as if we are
doing neither supervised learning or unsupervised learning. So here are again according to
Guo some 7 steps in machine learning. This is not a hard and fast rule but this is a very good
abstraction of the whole machine learning process.
36
The first step of course is to decide on what data you want to use for the problem and then
gather the data. Often you will have to do other things, here you will have to clean the data
app, etc, etc, etc, which is the 2nd step. You also want to ensure that there is no inherent bias.
So, for example people doing an election polling. They want to make sure that they have not
taken one section or the other. Similarly when you are like, let us say data for whether a
person has cancer or not, you are likely to have something which is called class imbalance.
That is because if I randomly collect data from the population, 99.5 percent of the people are
bound not to have cancer. So even if I take random data and say this person does not have
cancer, I am going to be right 99.5 percent of the time. Because the amount of data that I have
for people with cancer is actually very very low. So when you prepare data, you want to
make sure that either the data is without bias or that you have sufficiently accounted for this
in your algorithm.
The 3rd part is choosing a model or algorithm. So we will be covering a large number of
algorithms through this course, okay. So some of those are written here random forest,
artificial neural networks, etc, etc. And so, choosing an algorithm is part of the problem, there
is no hard and fast simple rule for which algorithm works the best for which problem. This is
very much like modelling in engineering sciences, there is no always clear model that you
can use. Some models perform well in some domains and some perform well in some other
domains.
We are discussing details of several models and algorithms in this course so that you can
appropriately choose and find out. Of course choosing this is more of an art than a science,
okay. And then comes training, each model that you will have will have certain unknown
parameters which we will see in the rest of the course. For using data in order to determine
model parameters is known as training. And of course, after you do this, you then try and find
out, you test how did my particular model and particular set of parameters do.
And if it did not do well, you might have to tune a few things, hyper parameters we will come
to in the end. And after this whole process is over, this is the training process, training and
testing process, prediction is final deployment. So let us say out of all this he made an app
which does machine learning and it is a cat identifier. Final prediction is you deploy the app
and the customer uses it in order to deploy and check whether this person has a cat or not or
whether it is spam or not spam.
37
So the first set of algorithms we will be doing in the course will be supervised learning
algorithms. And typically in supervised learning it splits into 2, you either have classification
problem. Classification problem simply means you want to split the data into discrete
categories. So this could be category 1, this could be category 1 and this could be category 3.
So, all the persons, the end result that they want is to know what is this, is that A, B or C, this
is a cat, dog or a horse, is this email spam or not spam, such a problem is called a
classification problem.
This happens whenever you have discrete data. For example cancer, not cancer, benign
tumour, malignant tumour, etc. tumour classification, etc would be classification problems.
Another problem is a regression. Regression says it typically has real number data, has a
number of associated with it and you have an example of something happening in the past.
You could have house prices depending on their area, you could have for example the
example that is written on the slide, you could have previous stock prices and you want to
know what the stock price is going to be tomorrow.
Such problems are regression problems, they are not really speaking classification problems.
This is not good or bad but you actually want an actual number out of this one set of
numbers. These problems are known as regression problems.
38
So, some of the mathematical ideas that we will be using in this course are linear algebra.
Why do we need linear algebra? Remember that as I said earlier, machine learning involves
mapping. It involves mapping of what, from an input vector to an output vector. Now what
maps vectors to vectors? This is a series of matrices, okay. So if I take one vector and I have
to map to another vector of a different size, I have to use a matrix, okay. Which is why we are
going to look at linear algebra.
Again we will only cover very very rudimentary ideas, most of it should be already familiar
to you with linear algebra. Okay. Next is probability. So the reason we use probability is
whether it is the data that is given to us or the results that we see. You might see a person
from far and might not know whether this is quite your friend or not. A person identifying a
criminal from a line-up might not be 100 percent sure that this is exactly the criminal that
they want. Similarly the machine and need not be completely sure that this image is that of a
dog or cat or this tumour is cancerous or not cancerous.
So they have some amount of uncertainty built into them. So we account for this uncertainty
using probability theory. A very important component of machine learning is the idea of
conditional probability. So in case you do not know it, please do refresh it, we will be looking
at it through this course also but this is just a heads up for you, that conditional probability is
particularly important. The next idea that we will be looking at is that of optimisation.
The reason we require optimisation is that we have whole bunch of models within machine
learning and we want to find out which set of parameters is the best for a given model. Of
39
course when we come to optimisation, automatically you come to differentiation and you
come to multivariable calculus. So we will be looking at simple calculus, even though we
will not be covering calculus, we will be looking at multivariable ideas such as gradients, etc.
which are very important to find out optimisation.
Optimisation is really important because finally most machine learning models actually
reduced to just solving some optimisation problems or the other. In fact modern machine
learning theory is, extensively it uses optimisation theory.
40
Machine Learning for Engineering and Science Applications
Professor Dr. Balaji Srinivasan
Department of Mechanical Engineering
Indian Institute of Technology, Madras
Why Linear Algebra? Scalars, Vectors, Tensors
In this video, we will be beginning our mathematical excursion, we will

start looking at the rudiments of the linear algebra which is needed for this
course, as I said earlier linear algebra is a vast vast vast subject we are going
to only look at very tiny pieces of linear algebra. In this particular video, I
am going to look at just two simple things, why is it that we require linear
algebra and also the basics of what scalars, vectors and tensors mean?
41
Now, why is linear algebra useful in the context of machine learning, as

I had said earlier, in many machine learning algorithms or in fact in most
machine learning algorithms the input and output are both represented as
vectors. By vectors we simply mean a collection of numbers. Now, part of
the problem, is in machine learning as I said earlier is to convert what seems
to be a qualitative input, for example a picture, a sound, you know colours,
even sometimes smells, something of that sort into a number, because ma-
chines only understand numbers and our algorithms work only on numbers,
they do not work on qualitative inputs, they work on quantitative inputs,
okay.
So, let us see how we can do this? So, what I am going to show (in the
next video) in the next slide is how we can actually convert what looks like
a qualitative input, for example, I am going to look at a picture and let us
see, how we can convert it into a number.
42
So, how do you take an image and turn it into a vector? Here is an exam-
ple, you can see on your screens, whole bunch of 0s and 1s, these are hand
written digits this database is called, the MNIST database, we will look at
this in detail when we come to convolutional neural networks, but what it
is, is just a bunch of images, what is shown here is actually a collection of
images of 0s, 1s and up till 9.
So, the machine learning task that people usually deal with when it comes
to MNIST database, is to look at an image let us say this, and identify which,
out of those, 10 digits it is, okay. So, the input here, is an image and the
output here, is what we will call a class or actually which digit this is, so you
have 10 possibilities 0 through 9. So, now the question is how do you repre-
sent the input as a vector, or as a series of numbers, and how do you express
the output as a series of numbers, the output is kind of obvious because it
goes from 0 through 9, you can at least think of a single number coming out
as an output that number is going to be either 0, 1, 2, or up till 9, but what
we do with the input, okay.
So, the input usually looks like this, okay, something of this sort, okay.
Now, we know, that this input, this image, that even you are seeing on your
screens is actually dependent on the resolution of your screen and the way it
is actually represented in the computer is actually through a series of pixels,
so the machine has a whole bunch of pixels, let us say, in this case we have
a 60 × 60 grid of pixels, so each of these pixels has a single value, associated
with it, each pixel has a number, associated with it, typically this is what
43
the numbers look like.
So, what you are seeing on your screen is some numbers that vary be-
tween 0 to 255, each of these corresponds to a pixel this is typically you can
call it the pixel intensity, okay. So if you have a grayscale or let us say a
black and white or grayscale image, what it will give you is something like
0 for a completely black pixel and it will ramp up to 255 for a completely
white pixel, okay.
So, by giving a value between 0 and 255 and giving various values for
various pixels you can actually reconstruct an image, this is the way the
image is represented in a computer. So, this is a natural way of converting
your image, this is the image, or this is actually a series of images, but you
can represent it as a matrix, the matrix further if you wish, you can unroll
it into a vector, what do I mean by unroll? Suppose, I take let us say, this
was the first column, this is the second column, then I take this first column
and have, 1, 2, 3, 4, up till 10 values, then after that I take this put it at the
bottom, okay.
So, this way the whole of the matrix is unrolled into one single vector and
usually we do, do that in machine learning algorithms because, it is actually
easier to handle vectors, rather than matrices, but, it depends on which kind
of algorithm you are using. So, you have a non-uniqueness in representation,
you can take an image of this sort and either represent it, as a simple matrix
of numbers or as a vector of numbers.
The more important point, here is to see that what looks like a number
to us, or what looks like an image to us, can be actually converted into a
full vector, okay. To give you, another example, you can see something like
this, this is an image of India, this is a colour image, okay, as against the
previous one this was a black and white image and this one is a colour image.
Now, for a colour image you do not need just, you need not only a single set
of intensities, but it is usually represented in, RGB, that is red green and blue.
So, you actually will have 3 sets of matrices, one of which will show you
the red intensity of this image, one of which will show you the green inten-
sity, and one of which will show you the blue intensity, and all these, put
together give us the impression of a single image with varying colours, okay.
We will look at this in greater detail, when we come to convolutional neural
networks, but, for now you can see on your screen. You know, I have just a
small MATLAB script. I took this image, this India image, and I found out
44
what the size of this image was, you know, this kind of representation, what
is the size.
What it shows is, the size or at least the, 2D size of this box has 600 × 538
pixels, but then there are 3 such layers that it has, so what we have is a,
600 × 538 × 3 image, basically this is a matrix. So, this is a matrix of dimen-
sion 3, and in the first dimension there are 600 entries, second dimension 538,
and the third dimension you have 3 entries, okay, all this put together essen-
tially it is a stack, it is a stack of 3 images each of which are, 600 × 538, okay
so you can have, RGB.
Okay, so the take away point from this particular slide is that you can
take any image and you can turn it into a vector or a matrix of numbers, so
please do remember this. So we will be dealing with such vectors throughout
the course.
So, let us now look at some simple notation, if you are familiar with the
notation for matrices scalars, etc., you can skip this slide very easily, okay. A
scalar is, you know, a single number we typically use small Greek letters for
scalars, okay. R, here is real numbers as you might know. So, an example
is let us say α, is the learning rate, let us say, n is the number of hyperpa-
rameters, learning rate and hyperparameters, are things that we will come
up later on in the algorithms that we use for machine learning.
45
A vector in machine learning is simply an array of numbers, okay. Now,
typically in physics or even in hard core mathematics, vectors have very spe-
cific meanings, we are not looking at that we are looking at any, even if it is
an unconnected series of numbers, for example, x1 could be height, x2 could
be weight, xn could be number of people. So, you can put together, any num-
ber of things together, all of those put together as long as you concatenate
it into a column, we call it a vector, x. So, please do remember this is a bold
letter, bold small letters, we will typically use this for vectors or sometimes
→
we might even use something like x.
So, we use this kind of notation interchangeably, either we, use this or, we
use the column matrix representation, for vectors, vectors will be the quanti-
ties that we are dealing with most often. Of course, you have the next level,
it is a matrix, this again means that, W is a m × n, matrix and it is simply
a 2-D array of numbers. In general we use the term tensors for anything,
which is a series of numbers with number of entries, or number of dimensions
greater than 2, okay. So, in that case you will denote it by a subscript Ai,j,k ,
okay.
So, let us look at this is some more detail, once again the same thing,
all of these, whether it is a scalar, vector, matrix or tensor, all of these are
effectively still the general term here is, tensor, okay, and it is because of this,
use of tensors or multi-dimensional matrices that Google calls its package for
machine learning as, tensor flow, which we will look at a little bit later in a
46
couple of weeks.
So, now, scalar is nothing but a 0th order tensor, a vector is called a first
order tensor. Now, I would like you to be a little bit careful, what is the
dimension of this vector? Now suppose, you have a vector in 3 dimensional
space, we will, typically, denote it with let us say, it is a location you would
denote it with three numbers. So the number of entries, is what we call the
dimension, okay.
So, in this case, the dimension is 4, however the order of the tensor is 1,
what I mean by the order of a tensor is you simply have one single column,
if you had a column and a row then, it will be a second order tensor as we
will see shortly, but notice, that the number of dimensions of this vector is
equal to the number of entries, or number of components, that it has, okay,
so this has 4 components, you could denote it by, [1, 2, 3, 4], for example.
Now, if we go back and think about our image example, remember this
example that we just had let us say, this is a 60×60 pixel image, and if I turn
this into a vector, the way I would do it, is first I will turn it into a matrix,
the matrix will be a 60 × 60 matrix, at this point it has both a row and a
column, then I could unroll it, how would I unroll it? As I said you have
first column, you have a series of numbers here, you have second column, you
have another series of numbers here, you take this series of numbers, put it
at the bottom, third column put it here, so on and so forth then you unroll
it, the size of the vector is going to be 3600.
So, the point is that, the image, a 60 × 60 image can be written as a vec-
tor of dimension 3600. So, this is a huge number of dimensions, if we think,
about it in terms of, the number of dimensions that we usually deal with
in physics, in physics you are typically dealing with 3 dimensions, okay, so
length, breadth, height, okay, x1 , x2 , x3 , xyz, coordinates, okay. So an image,
a 60 × 60 image can be thought of, as a vector which has 3600 coordinates,
so 3600 coordinates, each pixel, can be thought of as a coordinate.
Now, this kind of representation, is extremely useful as we will see through-

out the course and I will talk about it briefly at the end of this video also,
okay. So, please do remember this idea, you have the order of the tensor,
which is simply the way you represent it, you also have the dimension, the
dimension simply means, the number of independent components that you
have in the vector, okay.
47
We can move on, we have matrices this is a matrix, because it has, both
the length and the breath and I would typically call this Aij as a particular
entry, if you want to think about dimensions, you can unroll this too, you
can unroll this into, [1 4 2 5 3 5]> , or alternate ways, then you can think of
this as a 6 dimensional vector, okay, you can even think about it this way.
Now tensors are, third and higher order tensors effectively, basically you will
have as I said earlier, Aijk three sets of components effectively.
Now colour images, as I showed you, the India image, earlier it has nat-
urally, got a tensor representation, okay, number of pixels in each channel
multiplied by number of channels, okay. Video data, now you can think of
that is interesting, because, now a video data is a series of images, and each
image, has you know, Nx × Ny × (let us say 3), so, each image is a 3 dimen-
sional tensor, and you can now, think of the video as a series of images, each
of these, is a frame.
So, this now becomes, a 4th order tensor, okay. So, colour videos, for ex-
ample, naturally fall into 4th order tensors, and as usual, as you can imagine,
they will have a huge huge huge number dimensions, because each image by
itself has so many pixels.
48
Implications of this kind of representation, I will go back to something, I

showed earlier, this is an example, of let us say pre labelled data, okay. Now
obviously, I am showing this figure to you in 2 dimensions, but now you can
think. Now, each of these crosses, could mean any number of things, okay.
Now, suppose now, if you, you know, relax your imagination and think about
it, suppose this was a 3600 dimension image, okay, you have multiple dimen-
sions, and each of these crosses or each of these dots actually represented an
image, okay.
So, then you could think, of this set of images being something, let us
say cats, this set of images being something horses, this set of images being
something let us say dogs, okay. What you would like is some way in which,
similar, images in case you are doing a classification problem, similar, images
land up at similar places, this is an implication of our turning everything
into a number, is finally you can represent it in a graph, and then you can,
start thinking about a classification problem, simply, as if you are doing a
graphical partition, okay.
Now, this need not only, be an image, it could be sounds, maybe people
who speak, you know if, I have a speak signature, maybe all of my speech
signatures, will land up in the same part of the graph and somebody else’s
speech signature will land up somewhere else, it could be words, if you could
somehow, turn every single word into a number, then maybe words with
similar meanings or similar implications or close relationships, land up in the
49
same part of the graph, this is actually a profound implication, okay.
So, our point is that, we are going to represent, both vectors as well as
transformations as tensors, I will talk about that shortly. A transformation,
is something, it is an operation between, or it is a map between a vector to
a vector, so let us say you have v1 , is say, [1 2 3]> and it is a 3 × 1, vec-
tor and v2 is you know, it is a 5 × 1 vector. Now, somehow if you want to
find out a map that goes from v1 to v2 , what is the most natural way to do it?
So, you can say, something like, v2 is, so, remember this is 3×1, this is 5×1
and I could choose some matrix, W and say v2 = Wv1 , if this is 3 × 1, this
is 5 × 1, the natural way to go from one vector to another vector is, to stick
a matrix up front or up later and what sort of matrix should this be? 3 × 5,
okay, sorry I think I switched the numbers so in case this is 5 × 1 and in case
this is 3×1, then this simply becomes 5×3, so (5×3)×(3×1), you see, 5×1.
The point is, the natural transformation, or in the natural mapping be-
tween one vector and the other vector, is to put a matrix up front, once again,
I would like you to think about the image classification problem that we were
looking at earlier, I had an image, it was something like this and this image
was represented, remember, as a 3600 × 1, vector, my output, is simply a
scalar, okay how do you go from this to this? You put a matrix up front, okay.
So, if you put a 3600×3600, matrix up front, A×v1 is v2 . So the machine

learning algorithm, has to somehow figure out this matrix which will take
every possible image and then turn it into the right number, okay, obviously
what we do is actually, a little bit more sophisticated, then this it is not as
simple, but, none the less this gives you an idea of what we are going to do
with machine learning, it is to try to find out what is this transformation,
which is going to turn one vector into another vector.
As I said earlier, all these images, or all these dots, all these vectors, that
we have, could be very high dimensional, even as smaller image has 60 × 60
essentially has to be represented as a 3600 × 1, vector which means it is 3600
dimensions. So, the implication is that you will need algorithms, that work
very very well in high dimensions, we will see the need optimization algo-
rithms, that work on very high dimensional data.
A very important implication, especially for engineering applications, is

that this kind of representation, let us go between images and numbers, okay,
every image can be thought of as a series of numbers, but, it also means that
10
50
a series of numbers can be thought of as an image, okay.
So, we will, actually intelligently use this, later on in some applications,

as we come to vision algorithms.
So, in this video what we saw were two important things, one is the idea
of turning any kind of qualitative data, you have, into a vector of numbers, in
particular, we saw some examples of how to do this with images, the second
thing we saw, was simple notations for scalars, vectors and tensors the most
important, mathematical idea that, I would like you to take is, that of a di-
mension of a vector, it simply means the number of components, to uniquely
represent any image, you need a large number of components, therefore, an
image can also be thought of as a very high dimensional vector, thank you.
11
51
Basic Operations
In the last video, we looked at basic notations for scalars, vectors and
tensors. In this video we will look at some, very basic operations that we
can do on scalars, vectors and tensors, you should be familiar with most of
this already, there are a couple of special operators, that we will be looking
at, but, most of you are familiar with this, even from high school.
52
So, the operations that we will be covering in this video are addition, and
a special type of addition, called, broadcasting. Then, we will be, looking at
multiplication, within this we will look, at the matrix product, the dot prod-
uct, and something called the Hadamard product, all it is, an elementwise
multiplication. Finally, matrix transpose and the inverse, okay. So, I would
suggest that you can skip this video, if you already know this, this is very
basic material.
So, addition as you know, a normal matrix addition is simply you take
53
one matrix, and you add the other matrix to it, elementwise. For this of
course, the sizes need to match, so, if A is m × n, B also has to be m × n,
and C also has to be m × n. An example, is shown on your screen, this is
basically
the MATLAB output. So, you take some simple matrix A, in this
123
case, , add another matrix B, and it gives you the output C. You
456
can see, any element if you see, the element in C, which is 12, this is simply
the corresponding element in A, added to the corresponding element in B.
Now, a special type of addition, which we will be using, within machine

learning, this is usually a programmatic thing rather than, you know, really
mathematical, but, in a program we often, add a matrix, this is actually just
a vector, and this gives back a matrix. So let us say, you have a m × n
matrix, and C also has to be m × n. What you do in this case, is in case
either the number of rows or all the number of columns of B matches, you
tend to make multiple copies of the same vector, and add it, this is called
broadcasting. All it is, it is adding a vector to a matrix by simply repeating
the vectors so that it comes to the same size as the original matrix.
Obviously, it can be done, only if the vector that we are choosing either
has the same number of rows or the same number of columns, okay. This
is, automatically done in MATLAB and Numpy, especially in the recent ver-
sions of MATLAB. Numpy, is a python library which we will be looking at in
the third week. So, this is done automatically in bothof these. Just to show
123
you an example, let us say, A is the same old, , and B is the, well in
456
this case the row matrix, [1 1 1], then all you need to do is, this, [1 1 1], gets
added to the first row gives you, [2 3 4], it also gets added to the second row
which is, [4 5 6], and gives you, [5 6 7].
So, this happens automatically within MATLAB. We did not have, any-
thing special, written here, I just put A + B. Similarly, in Numpy also this
kind of addition is done automatically it is assumed that, if you have a size
mismatch then, you actually have broadcasting going on.
54
Multiplication, so of course, all of us are familiar with the matrix prod-

uct. So, this also needs a size match, you know that, the first element of the
product comes from, a dot product, essentially of the first row with the first
column. Similarly, if I have the ijth element, this is the ith row and the jth
column multiplied together,
P give me the ij element. Mathematically, it can
be written in this way, Aik Bkj , is equal to Cij . We know that, as usual, if
k
you have m × n, and n × p, then Cs size is, m × p, okay.
Sizes of course, must match in the sense that, you must have the number
of the second element, in this case match the first element in this case, okay.
So, the number of columns and the number of rows have to match. You can
also have, as usual a matrix multiplying a column vector, okay. So, we have
seen, one such example here. You also have, something called the Hadamard
product. This is, simply an elementwise multiplication. All we have to do,
is to say that if A is m × n and B is m × n, then C is also m × n, all you
are saying is Cij is equal to Aij multiplied by Bij , no summation unlike here,
okay, you just take the corresponding element here, take one element here
multiply it and it gives you the corresponding element there, this is analo-
gous to what we did with addition, okay.
Why is it that, the normal matrix product is defined in this weird way, it
turns out that it has several advantages, linear algebra wise. We do not have
the time to go through that, in this course. But, it turns out that more often
they are not when the products occur, they usually occur as matrix products
55
rather than as Hadamard products, but none the less in deep learning and in
machine learning, we will encounter this kind of elementwise multiplication
multiple times which is why that, is being written out as a separate thing.
So just as an example, is flashed on your screen here, we have taken the

same old A, again the same B and you have a corresponding multiplication,
for example the, 22 element here is 5, 22 element in B is 7, and 22 element
in C is 35.
Finally, we have the vector product, we are not going to look at cross prod-
ucts really, as far as, this course is concerned, that comes more in physics. In
machine learning, usually we are dealing with dot products. So, a dot prod-
uct, you remember, is the product, simply between two vectors which gives
you a scalar, we know that, this dot product simply is, (a1 b1 +a2 b2 , . . . , an bn ),
of course the assumption is a and b are of the same size.
What is more important here, is you know one example, is given there,
if a is, [1 2 3]> , and b is, [4 5 6]> , then, α = 4 + 10 + 18 = 32. Now, more im-
portantly, as far as machine learning is concerned, we can usually also write
it in this notation, this is what I would call matrix notation, this is vector
notation. Since, we will deal a lot with matrices, it is sometimes useful to
see this as matrix notation.
So, in this case, I can write it as, a as, really speaking, vector should
always be represented as columns. So, you write one of these, as a column
56
vector and one of these, as a row vector, this would be a> b, that also gives
you the same product using the matrix product rule, (1×4)+(2×5)+(3×6),
so, the dot product can be written as, a> b. So, a · b can be written as, a> b.
Of course, since, α is a scalar if you take transpose of this whole

  thing you
1
>
can also write this as b a. So, if I had written it as, [4 5 6] · 2, it will not

3
have made a difference, because, I am still making the same product. So, we
will be using this kind of notation repeatedly, okay. So, if I do a> b, transpose
in MATLAB is represented by a prime, okay, so a prime here denotes a> ,
okay, obviously that is going to give you the same result, okay.
So, this kind of thing, we will be using extremely often. So, please get
comfortable with this, which is denoting a dot product between two vectors
as, a> b or b> a, and also as a · b. So, all these three will be used interchange-
ably. I am going to, write this again a · b, in case, a and b, are vectors, is
the same as, a> b, is the same as b> a.
A couple of other operators, that we will be looking at, first is of course

the transpose, transpose all of you know is simply written as, A> . Math-
ematically, all you do is you take, sort of across the diagonal you take a
mirror image, in case it is asquare matrix. So, Bij will be Aji , okay, so if
123
you have the matrix, , the same matrix we have been using, it gets
456
57
flipped so, if A has size 2×3, B becomes a 3×2, matrix and the elements flip.
Inverse is a matrix, which gives you, when multiplied by the original

matrix it gives you back the identity, whether you left multiply it or right
multiply it. We are going to assume here that A is square, so if A is of
size n × n, A inverse is also of size n × n, and when you multiply A, and A−1 .
Inverse is that matrix which when you multiply it by A, you are going to
recover I. I is of course the identity matrix which has, 1 in the diagonal and
0s everywhere else, so, I will just represent it as a big 0, so this is, I, you are
going to get an n × n, I matrix, okay.
Now, it is important to know that, not all square matrices have a good
valid inverse, it is also possible to find out (matrices sorry) inverses in the
case of non-square matrices, in such cases, we use something called the pseu-
doinverse, which we will be looking at later on in the course.
So, just as an example, of an inverse, if I take a random 3 × 3, matrix

and take its inverse. If I multiply the matrix by its inverse you recover the
identity matrix. One thing, I would like to point out is, you will notice that
the 0s are not quite 0, you have some decimal places of accuracy, this is
something very important, this is called finite precision, which means just
like in your calculator you have only a certain number of digits, that you can
represent accurately, now depending on the calculator you might have 8 or
10 places, okay, and certain other digits are actually uncertain you are not
58
sure what happens after the eight digit.
In many cases, the result that the computer gives will actually, be accu-
rate only up till a certain number of digits, typically 16 number of digits,
okay, if you have what is called double precision. We will be looking at the
implications of this later on, but, you can kind of see it, that, some of the off
diagonal terms, it calls positive and some of them it calls negative, because,
you know may be the 10th or 11th digit is 1 or 2 or something of that sort,
okay.
So, in this video, what we looked at are some basic simple operators. We
looked at addition, more importantly, broadcasting, which is an extension of
addition, see you would do it, using overloading the operator. We also saw,
elementwise multiplication, and the other operators are some things that you
should have been already familiar with, thank you.
59

Norms
In this video we will be looking at an important idea, this is the idea of norms this is one idea
that we will be using throughout the rest of this course.
60
So norms are an idea in linear algebra or in general whenever we deal with tensorial
quantities. The basic reason why machine learning and many other fields use norms is
because we usually use vectors or matrices as our basic units of representation. As we saw in
the last video we tend to use vectors and matrices very very often basically because that is
what we use in order to measure or in order to represent images, sounds or anything in fact
anything that goes our input or output is usually measured by vectors and matrices.
So there are two basic reasons that we use norms, one is to find out how big or small a
particular vector or tensor is sometimes we need to estimate the size of something. Now for a
scalar or if it is a scalar like a weight or pressure or temperature there is one single number by
which we can get the idea of how big this thing is, whether it is negative or positive the
absolute value usually denotes what the size is for a scalar.
For a vector we have no such single number of course vector is a bunch of numbers but
suppose you need a single number. So norms sometimes can be thought of as a mapping from
a vector or a tensor to a single number or to scalar and actually this is a positive scalar. So we
will see how to do that in the rest of this video, there is another reason for which we use
norms.
So for example let us say you have a vector of this sort, usually we will denote the size or the
length of this vector as √3 2 + 42 = 5 . So the usual notion of length a norm is denoted by
61
this sign usually a ||.|| just like for scalar we use |.| for absolute values, for norms we tend to
use ||.|| , some people use |.| also so we will see this notation a little bit later on in the video.
So whenever you hear me say norms please think of you know a simple vector for which you
are trying to find out the length. Essentially you are trying to find out one single number that
will represent the size or how big a particular vector is, there is another reason for which we
use norms which is to try and estimate how close one vector or tensor is to another. So once
again I would like you to think about the idea of images in order to show something which is
qualitative, where you can estimate this.
So please remember if you recall what we did in the previous videos, we had looked at a
whole image. So let us say you have an image of a cat or something and this is a 60 × 60
image, we saw that this can be unrolled into a single vector which is of size 3600, each of
these represents one pixel, okay. So you have 3600 pixels, so it can be written as a vector of
dimension 3600.
So now you cannot really imagine this but let us assume that instead of this, this is just two
numbers so it is as if it is an image of just two pixels, but suppose you have a one whole
image of 3600 pixels, you can now imagine this is one image and this is another image, of
course we are representing it in two dimensional space, so each of these points is a vector
which represents one image and suppose you want to find out is this image close to the other
image, okay now how would you do that?
So that idea also basically would be how big the difference between these two vectors is we
know of course that the difference between two vectors is another vector. So if you have this
vector v 1 , this vector v 2 , v 1 − v 2 is another vector and I could find out δ v = v 1 − v 2 if I
find out the norm of ||δv|| = ||v 1 − v 2 || , or the length of this vector which is the difference of
these two vectors that will tell me how close the two images are.
So a norm is supposed to represent both these ideas or atleast its used when both these ideas
which is essentially if you can somehow define one single number to represent the size of one
whole vector or one whole tensor then you have the idea of norm. So usually like I said just
now you can try to find out how close one sound is to another if you have two
representations, how close one word is to another, how close one image is to another provide
62
you all of this can be represented as vectors and you can find out the norm of the difference
between the two vectors.
So now let us see how to go about doing this. The norm is actually a generalization as you
can probably figure out of the notion of length, the idea that we have of length for simple
scalars can now or size of simple scalars to vectors, matrices and tensors.
So let us say you have a vector all my example which I show on the slide will be in 2D of
course you can imagine this being extended to multiple images. So the numerical example I
will be taking would be that of a 3D vector.
Mathematically, a norm is any function f that satisfies
● f (x) = 0 ⇒ x = 0
● f (x + y ) ≤ f (x) + f (y) ( Triangle Inequality)
● ∀ α ∈ ℜ, f (x) = |α|f (x) (Linearity)
So the first notion which is very important is if you have a vector whose length is 0, then that
means it is a 0 vector. So the only vector which is of length 0 is essentially this vector which
is right at the origin, okay so that is the first property that any norm should satisfy that is if
the vector has length 0, then it must be the 0 vector, okay. So this is the definition of norm
that we will be using here.
63
The second property is the property of the triangle inequalities, so let us say you have two
vectors please notice I have flipped the arrow here just in order to be consistent with the
mathematics that I will be using. So let us say the first vector is x and the second vector is y,
okay. Now we know that x + y has to be this vector here, going from here to here it is a
simple vector addition rules. Now what the triangle inequality rule for the norm says is that
the length of (x + y ) has to be always less that the length of (x) plus the length of (y) , we
know this from the normal triangle inequality that we use for triangles right from schools, the
length of two sides is always going to be larger than the length of the third side, the sum of
two sides is always going to be larger than the third side that is because the shortest distance
between any two points is a straight line. So if I want to go from here to here, you know if I
go that way that will always be longer than this, so this is the normal triangle inequality rule.
It is represented as f , f you can think of function which represents a norm,
f (x + y ) ≤ f (x) + f (y) , norm of the sum of two vectors is going to be less than equal to the
norm of the individual vectors. It is a very very important property.
The third property that a norm satisfies is that of linearity, what it means is if I take a vector
and simply scale it up, take a string extend it by two times each of the coordinates will
increase by a factor of 2, so let us say if I increase it by a factor of α then its length also
increases by a factor of α , these are the three properties that any norm satisfies.
64
Now based on these three properties that we just saw here the idea of 0, the idea of triangle
inequality and the idea of linearity, what we can do is we can derive many many many
different functions that satisfy this. So remember, f (v) → scalar (+ v e), f the norm takes in
a vector gives a scalar which is positive and you can define many functions which satisfy
these three properties.
So let us take a simple example, so we are taking the example of a vector which is
[− 5, 3, 2 ]T , so let us say we have a 3 dimensional vector and we will see various norms that
can be used for this simple vector. The first and the most obvious norm is called the
Euclidean norm sometimes the Pythagorean norm, the Euclidean norm
1
||v||2 = (v 1 2 + v 2 2 + · · · + v n 2 ) 2
you will notice has a subscript 2 the reason for the subscript will become obvious very
shortly. So you have a vector all it is root of the sum of squares, so you take in this case you
would do √(− 5) 2
+ 32 + 22 essentially what we usually call the length of the vector, this is
also called the 2-norm or sometimes also called the L2 − norm , the reason for the L will not
go over but usually you will see this term being used a lot of times 2-norm or L2 − norm .
So what is the L2 − norm of this case? It usually corresponds to our notion of distance so you
can immediately find out that this is equal to approximately 6.16. A similar norm is called the
1-norm
||v|| = |v 1 | + |v 2 | + · · · + |v n |
please notice the subscript here, all it is instead of squaring and taking square root you simply
add the absolute values. So in this case our 1-norm would be very obviously I have written a
MATLAB command norm(v, 1) here, but you can do it by hand in this case all it is absolute
| − 5| + |3| + |2| = 10 .
Now using these two, you can generalize to the idea of what is called a p-norm, p-norm is
1
||v||p = (|v 1 |p + |v 2 |p + · · · + |v n |p ) p
65
So you will notice that this covers both the 1-norm and the 2-norm and this kind of definition
is valid for p ≥ 1 .
So usually you cannot define let us say a half norm or something but 1 and so on and so forth
you can define all other norms. As it turns out L2 and L1 − norm are extremely useful norms,
there is also a third norm which is very useful which is called the ∞ − nrom or sometimes
called the max-norm, so the max-norm simply is
||v||∞ = max(|v 1 |, |v 2 |, · · · , |v n |)
find out the maximum component in absolute values, so in our case max(| − 5|, 3, 2) = 5 . So
you can check that MATLAB has a command for max-norm, norm(v, inf ) gives you a
maximum of 5. Now what is interesting is you can actually see the max-norm as a limit of the
p-norm as you keep on increasing p. As you keep on increasing p, let us say the v 2 th
component was the largest component what will happen is all the other terms will become
very very small as you keep on increasing the power in comparison to v 1 p , v 2 p will be very
very large as p becomes large and in the limit of infinity this is the only term that survives
1
and once you take p
what survives is the maximum-norm. So this is either called the
infinity-norm or the maximum-norm.
Now I want to emphasize that the most natural norm at least the one that we think of very
naturally is the 2-norm, nonetheless 1-norm or infinity-norm can also be useful. Please notice
that each of these norms or all of these norms satisfy these three properties, we are not going
to prove this, we know that the Euclidean norm satisfy this by intuition just as a quick check
for example you can check that if you take the infinity-norm it is definitely going to satisfy
this, the only way in which can infinity-norm can be 0, that is the maximum of the absolute
value of something can be 0, if all the components were exactly 0.
Similarly if the sum of absolute values is equal to 0, the only way that is possible is each of
this individual each of this individual components is 0. So these three properties are satisfied
by all of these three norms. Now all these norms as I have showed them apply to normal
vectors you can actually extend this idea to matrices also, the idea of norm is true for vectors,
66
tensors and matrices. The definition remains the same or atleast the properties remain the
same x instead of being a vector becomes a matrix.
You also have 1-norm, 2-norm, infinity-norm for a matrix, but in machine learning the most
common norm that we use is what is called the Frobenius norm. Frobenius norm
( )
1
2
AF = ∑ Aij 2 is very similar to the Euclidean norm all it is you take all the components
i,j
of a matrix, so let us say I have a matrix here , the Frobenius norm of the matrix is
√12 + 22 + 22 + 02 basically some of the squares take the square root, okay that is the
Frobenius norm, in this case this is √9 = 3 . So that is the Frobenius norm.
Please notice the Frobenius norm denoted by AF is not the same as the matrix 2-norm, there
is some such thing as the matrix 2-norm or the matrix you know L2 − norm that is not the
same as the Euclidean norm, so there is a slight difference there nonetheless the Frobenius
norm is probably once again the most common thing that you will think of, immediately if
you want to find out one number that represents the size of the matrix.
So this is the idea of the norm we will be using this repeatedly again and again through the
rest of the course, one of the main uses that we will be using it for is you know as you are
using iterative procedure for a vector, okay so suppose you are trying to find out some
particular parameter vector or some particular image and you are trying to go slowly go there
through an iterative process your initial guess is bad and you are slowly getting there, you
want to find out how close each guess is to the final guess and one of those ways to find out
is as we saw earlier find out the difference between the two and take there norm. So we will
be using this repeatedly through the rest of the course, thank you.
67
January 25, 2019
Department of Mechanical Engineering, Indian Institute of Technology, Madras
Linear Combinations, Span, Linear Independence
In this video, we will be looking at three fundamental ideas in linear algebra, the idea
of linear combinations, span, and linear independence. The idea of linear combinations we
68
will use multiple times through the rest of the course. You can even think of this as simple
denitions, but they are very very powerful ideas when you do a full linear algebra course or
if you have done a full linear algebra course. In fact, a lot of power of linear algebra comes
from these three ideas.
The idea of the linear combination is simple, let us say you have a set of vectors
{v (1) , v (2) , ..., v (n) }. Remember I have shown this in bold which means each of this is a
vector. So suppose v (1) through v (n) is a set of vectors, what you can get by combining each
one of them through a linear combination is simply some coecient multiplying this plus
some other coecient multiplying this.
So some scalar coecient multiplying each of these vectors and adding them is called a
linear combination. It is a very intuitive kind of denition. So mathematically you would
write it as α1 v (1) + α2 v (2) so on and so forth up till αn v (n) , where α1 ,α2 etc are dierent
69
 
 1 
 
 2 , v 2 is the vector
scalars. So let us take a simple example, let us say v 1 is the vector 



 
3
   
 2   5 
   
 
 0 
 
and v 1 + 2v 2 , for example, is a linear combination. In this case, we get  2 .




   
3 9
Now an interesting way of thinking about this linear combination v 3 remember
  was
 1 
v 1 + 2v 2 is to write it as a product. So we can write it as v 1 v 2 multiplying   so that
2
you get v 1 + 2v 2 . So in matrix notation, you can write it as this vector which was v 1 , this
vector which was v 2 . Now if you multiply this vector by this you will notice that the rst
element is 1 + 2 ∗ 2 = 5, 2 + 0 ∗ 2 = 2, 3 + 3 ∗ 2 which is 9.
So essentially v 1 + 2v 2 can be thought of a linear combination of two columns v 1 as the
rst column, v 2 as the second column so this is a tremendously useful way of thinking of
things, okay so matrix multiplication when you take one matrix and multiply it by the other
matrix, it can actually be thought of as a linear combination of columns. So for example if
 1 2 
 
  1 3 
you have this matrix once again  
 2 0  and I multiply it by a matrix 
 , okay so
2 4
 
 
3 3
this is basically the rst column of A, this is the second column
 of A.
 5 
 
So now what you can think of? You will notice that 
 2




is exactly what we had here,
 
9
why? Because it is a linear combination v 1 + 2v 2 , now what will be this column? This
column essentially is 3v 1 + 4v 2 . So the result of matrix multiplication each of the columns
of matrix multiplication can actually be thought of as some particular linear combination
of the columns of the matrix V that we are multiplying here, okay. So we will utilize this
idea when we come to the idea of invertibility, etc.
70
The second idea that we want to discuss in this video is the idea of the span, it is a
natural outgrowth of the idea of linear combination. The span of a vector or a set of vectors
is whatever you will get by every possible linear combination. So remember in the last slide
we had a simple linear combination v 1 + 2v 2 but suppose you write α1 and α2 free and you
try and nd out every possible linear combination of that, that is called a span.
This should become a little bit clear if we look at an example, okay. So let us say we are
looking at two vectors v 1 is (1, 0) and v 2 is (0, 1). What would be the span of this thing?
So mathematically what is going to be the case let us just look at it geometrically and then
we can see quickly mathematically what happens. So v 1 remember or notice is simply the
unit vector in the X direction, v 2 is the unit vector in the Y direction.
Any vector that you have, so let us say that you have a vector (α1 , α2 ). All (α1 , α2 ) is
α1 v 1 + α2 v 2 , okay because α1 ∗ (1, 0) going to be (α1 , 0) and α2 ∗ (0, 1) is going to be (0, α2 ).
So you can add these two and get any vector, what it means is the span of the coordinate
vectors is the whole of 2 dimensional space a whole of <2, any vector that I choose can
71
always be written as a linear combination of these two vectors, okay. So the span of these
two vectors is going to be the whole of the coordinate space <2.
Similarly, you can think of you know multiple things for 3D if you dene (0, 0, 1), (0, 1, 0)
and (1, 0, 0) if you dene these three vectors their span will be the whole of 3D space, okay.
The span of all columns of a matrix is called the column space. So if I have some matrix
once again I will take the same example as last time all the vectors that you will get of the
form α1 (1, 2, 3) + α2 (2, 0, 3) for all α1 and α2 , this will be the span of these two vectors.
Now notice that if I have an equation Ax = b what it means is I have some matrix A
and some vector x and I am obtaining some other vector b. So suppose I give you an A and
I give you a b and I ask you to nd out x, if this equation has a solution it automatically
means that b has to be in the column space of A, why is that we just saw this in the previous
slide, this vector is simply the linear combination of rst vector multiplied by x1 , second
vector multiplied by x2 , third vector multiplied by x3 and nth vector multiplied by xn it
automatically means that b is in the column space of A.
72
The nal idea in this video is that of linear independence. A set of vectors is dened
to be linearly independent if none of these vectors can be written as a linear combination
of the other vectors, these three ideas of linear independence, linear combination, span are
actually very very deeply related, unfortunately we will not have the time to go through all
the interrelations between them, you can treat them as three sorts of related ideas, okay. I
think some of you might automatically see the correlations between these ideas.
So in this, all we are looking at is if I have a set of vectors and let us say I take these
two vectors (1, 0) and (0, 1) you cannot write v 2 as a linear combination or as any linear
multiple of v 1 , okay. Now suppose I take these three vectors (1, 0), (0, 1) and (3, 4), now
these are not linearly independent I would call it linearly dependent, why is that? Because
v 3 is 3v 1 + 4v 2 so since v 3 can be written as a linear combination of the other two vectors
these three vectors are not linearly independent.

So mathematically we say that any set v 1 through v n is linearly independent if and only
if the linear combination α1 v 1 plus α2 v 2 up till αk v k for any k is 0 automatically implies
that there is only one possibility, notice that if I said α1 , α2 , αk equal to 0, obviously I am
going to get 0, but that should be the only solution to this system of equations, why is that?
Notice here, if v 3 is equal to 3v 1 plus 4v 2 it automatically means v 3 − 3v 1 − 4v 2 = 0.
In this form, I can write it as −3v 1 − 4v 2 + v 3 = 0. So I will check v 1 , v 2 , v 3 . I nd
out a linear combination α1 = −3, α2 = −4, α3 = 1 which gives me this equation equal to 0
without all the αi being 0, if that is the case then the set of equations is linearly dependent
and if the only solution to this system of equations is that α1 through αk is fully 0 then that
means that a set of vectors is linearly independent, thank you.
73
January 25, 2019
Department of Mechanical Engineering, Indian Institute of Technology, Madras
Matrix Operations, Special Matrices, Matrix Decompositions
Refer slide time: 0:15
In this video we will be looking at the nal piece of the linear algebra portions of this
74
course, specically we are going to look at matrix operations, some special types of matrices
and matrix decompositions, specically within matrix decompositions we will be looking at
Eigen Decomposition. Now all these ideas are ideas that you should have been familiar with,
please remember this is just a recapitulation of the kind of things that you need to know for
this course.
Refer Slide time: 1:00
If time permits we will look at greater physical interpretations of this, but by itself linear
algebra is a vast subject. So why are we looking at matrix decomposition also known as
matrix factorization? To remind you in previous videos we had seen that matrices transform
one vector to another. So if you pre multiply one vector by a matrix you get another vector.
Now this has physical meanings which we will look at as we go on through this lecture itself.
So typically as you remember we deal with very high dimensional vectors and tensors
within machine learning, you might recall that if you have a 60 ∗ 60 grayscale image you can
interpret it as if it is a single 3600 dimensional vector or one vector with 3600 components
75
you know pixel 1, pixel 2, up till pixel 3600. So these are just examples of the size of vectors
that you will be dealing with which means we are actually dealing with very large matrices.
So if we have to convert an n ∗ 1 vector into another n ∗ 1 vector, so if you have a vector
v let us say this is n ∗ 1 and this has to go to another vector let us call it w which is also
n ∗ 1, you will have to pre multiply by a matrix which is n ∗ n, okay which means if let us
say n is 3600, then A is 3600 ∗ 3600 matrix, okay. Now it is usually useful to understand
you know what these components mean and as it turns out its original form it is kind of
hard to understand and just like you know for a number we typically take let us say if you
have something like 91, you will say 91 is 13 ∗ 7, both of these are prime indivisible further
factors.
Similarly it is useful to factorize a matrix itself and you can think of an Eigen Decom-
position or other decompositions that we will be talking about as simple decomposing one
big thing into smaller thing which we can understand a little bit better. It is also useful
sometimes for a large matrix to be summarized by 1 or 2 or a fewer numbers rather than a
large numbers. So we have seen that norms for example for a matrix we often use atleast
within this course we will be using the Frobenius norm. So norms would be all this n ∗ n
reduced to a single number, so mapping from an n ∗ m or m ∗ n to a single number, so that
would be the norm.
Another such measure is trace all these are smaller measures obviously they do not
summarize the whole matrix, a determinant which you will be familiar with, eigenvalues,
singular values, etc are similar numbers which try to encapsulate some idea about what the
matrix represents as we will see later on in the slides.
76
So the rst idea we are going to look at is what is called trace of a matrix, it is simple
a trace of a matrix is simply the sum of the diagonal
 elements of the matrix, okay. So it is
 1 4 5 
Aii , so if you have a matrix let us say 
 7 2 6  then trace of this matrix A is 1 + 2 + 3
P  

 
8 9 3
which is 6 as shown here. Now the 
idea of trace you can use for non-square matrices also, so
1 4 5
 
 
 7 2 6 
 
let us say this is longer you had   8 9 3  etc you would still look at only Aii which
 

 
 10 11 12 
 
 
1 2 3
are A11 , A22 , A33 it would still be 6, okay. So typically however we will be using trace for
square matrices.
So the trace has certain properties: tr(A + B) = tr(A) + tr(B), tr(AB) = tr(BA) even
though even if AB is not equal to BA the trace itself is a property that does not change
77
when you commute the matrix product. Similarly tr(A) = tr(AT ) which directly follows
from its denition these are some useful properties.
The next idea is that of a determinant of a matrix again you will be very familiar with
this I just want to make the notation a little bit clear nothing much else in this slide.
So we all know that if you take a 2 ∗ 2 matrix you can simply dene the determinant as
a11 a22 − a21 a12 and that the determinant of a bigger matrix is using kind of a recursion
idea, so if you have something like A11 here then... and if we call this sub matrix a, then
determinant of A is dened as summation over the rows or columns we can do it either way
as you know (−1)i+j Aij det(aij ) of this is the sub matrix, okay again this is something that
is very very familiar to you from school.
Now more importantly the determinant actually represents the volume so if you interpret
the rst column, let us call it a11 , a21 ...an1 . So if we call this v 1 vector, this as v 2 vector
so on forth up till v n vector for a square matrix then so let us take a simple case, so if I
78
   
1 3   1 
have 
  I can now think of this as two vectors the   vector v 1 vector, another
2 4 2
v 2 vector in that case the determinant represents the area of this parallelogram, okay.
So similarly you can extend this to higher dimensions also if you have 3 vectors it will
be the volume represented by those 3 vectors 4 5 6 you can start interpreting this as n
dimensional volume so this has a very interesting consequences as we will see shortly.
So one thing is invertibility of a matrix again as you know A−1 is dened only if you
have determinant of that matrix A as non-zero. So A has a unique inverse if and only if
determinant of A which we will sometimes denote as if its absolute value, is not equal to 0.
Now this automatically means that the columns of A have to be linearly independent, how
does this follow?
So remember I will use the same example as last time, suppose we have a11 , a21 up till
an1 if we call this v 1 , v 2 is a12 so on and so forth and we have v n vector which goes till
79
ann . Now in case one of these columns let us say this column let us say v n vector could be
written as a linear combination of this was some α1 v 1 + α2 v 2 up till αn−1 v n−1 , then what
does this mean? By simply doing an operation of nth column goes to nth column minus this
thing, you will get all 0's this transformation as you know preserves the determinant which
means the determinant will become 0, okay.
This also has a nice physical interpretation, what it says is if you have one of these
vectors which can be represented as a linear combination of the other vectors the volume of
the thing formed by the parallelogram or the parallelepiped formed by these vectors actually
becomes 0, you can see this easily in the 2D case or even in the 3D case. So let us say you
have two vectors in this case you only have a 2 ∗ 2 matrix, okay. In case one of them is
linearly dependent on the other, it simply means that both these vectors or one of these
vectors let us say v 2 is equal to 2v 1 , then the area formed by these two vectors is simply
going to be 0, okay. You are going to get non-zero area only if one of them is not simply
scaling of the other.
Similarly if you have three vectors, if one of them is the linear combination of the other
two vectors it means all three are in the same plane which means they are not going to form
a non-zero volume. So there are multiple interpretations for A inverse existing and there
are deep connections with the determinant of the matrix.
80
So we will now look at some special matrices and vectors again this should be familiar
to you. The rst idea is that of a diagonal matrix, a diagonal matrix is one where only the
diagonal entries are non-zero, all other o diagonal entries are 0. So mathematically Dij if
D is the matrix is equal to 0 if i 6= j . A symmetric matrix is a matrix which has you know
it is symmetric across the diagonal, another way to say it is that the matrix is equal to its
own transpose.
A unit vector is a vector with as all of us are familiar unit length. In our notation which
we remember we had used the idea of a norm for length. So typically norm of that vector
is equal to 1, which norm if you ask typically when we say unit vector we mean the 2 norm,
okay please remember the 2 norm or L 2 norm is simply v12 + v22 so on and so forth up till
vn2 , square root, okay but you can dene a unit vector with norm 1 etc but it is usually the
2 norm that we use.

Another (usual) useful idea is that of orthogonal vectors, it simply means vectors that
are mutually perpendicular which means x, if x and y are mutually orthogonal mean x.y is
81
0, which remember can be written in this matrix form you take xT y and set that equal to
0. We also have the idea of orthonormal vectors or orthonormal vector set, where you have
unit vectors that are perpendicular to each other.
Orthogonal matrix is a matrix whose transpose and inverse are the same thing, which
means AT = A−1 , the simplest sort of orthogonal matrix is the identity matrix. It has some
nice properties which we will discuss very shortly, but a simple thing that follows from this
is that AT A = AAT = I , this also means that all columns are orthonormal.
So remember if (A is equal to) AT = A−1 when you multiply the matrix by its transpose
you are actually going to get quantities of this sort which have to be 0, if x and y are
not the same. Now where do we use orthogonal matrix even though this is orthonormal
column vectors we still call it orthogonal matrix, an orthogonal matrix typically can always
be thought of as a rotational operation, okay what that means is if I have one vector I pre
multiply it by something and all it does to that vector is simply rotates it without changing
the length, the matrix which should have been used would always be orthogonal this can
be proved we will not have time to show all that, but please do remember it whenever you
see an orthogonal matrix please think a rotation matrix, okay that is another way to think
about it.
82
So let us now come to the matrix factorization that I was talking about Eigen Decompo-
sition. This is typically very useful for square symmetric matrices specially square symmetric
real matrices even though you can use it for other matrices as well and I am sure you would
have done it before. As far as this course is concerned, we will primarily be using it for square
symmetric matrices and it has we are guaranteed several things, when symmetric matrices
when we have square symmetric matrices as far as Eigen Decomposition is concerned.
So here is the simple physical meaning that is usually useful in order for you to anchor
yourself in the Eigen Decomposition. So every real matrix, remember I have talked about
this before too if you have a matrix A, what it can do for now I will talk only about square
matrices. So if you have a matrix A if it (multiplies) pre multiplies a vector v , it results in
some other vector w.
Now you can think of A as a machine or an operator acting on v and giving you w, takes
v takes it to w, okay. So let us say this is the vector v , this is some vector w and A has
taken v to w. Now through physics as well as intuitively you can see that there are only
10
83
two things that this matrix A can do to v , it can rotate it that is it can turn it through an
angle even in 2D, 3D in any place that you can think of it turn it through an angle and the
other thing it can do is: it can change its length, okay.
So the length of v might not be the same as length of w, but it can stretch it, rotate
it or rotate it and stretch it these are the two operations that any matrix can do as far as
acting on another vector's concern. Now this is extremely useful. Now if you can think of
every operation as if it is a matrix, then there are special vectors and the only thing that
matrix will do to this vector is just stretch it.
Eigenvectors are those special vectors, so you are given a matrix A and there are a set
of special vectors for that which we will again call v eigenvectors which will only stretch
under the action of this matrix. What is an eigenvalue? Eigenvalue is the factor by which
this vector stretches, okay. So mathematically I would write Av is a new vector this is the
new vector w but this w is not rotated it is only stretched. So Eigen Decomposition is that
angle in some sense or that set of vectors which only stretch under the action of the matrix
A, this is the physical interpretation, okay.
11
84
So let us say so we will now write the Eigen Decomposition, you can think of eigenvectors
in some literal sense eigenvectors are essentially the coordinate system in which the matrix
A looks the nicest it looks diagonal this is one way of looking at it if you do not understand
this fact that is okay I will just write the mathematical expression right now. So let us say
A has n linearly independent eigenvectors that is A is n ∗ n matrix and it has n linearly
independent eigenvectors v 1 through v n .

Now we will do what we have been doing so far, I will write v 1 as if it is the rst column,
v 2 as if it is the second column remember v itself is a vector therefore it has n components
and we will go till v n which also has n components. Now you concatenate or put them
together and you get one large matrix the eigenvector matrix V . So this notation here if
you have a curly bracket it is a set if I put this I have put them together, there is no comma
separating this this is actually a set of numbers put together as a matrix.
Now similarly, if each of these has a corresponding eigenvalue λ1 , λ2 , ..λn and I put them
together into one giant matrix lambda which is a diagonal matrix... remember it is a diagonal
12
85
matrix so all o diagonal elements are 0, then we can write the factorization of A as A can
be written as a product of 3 matrices V multiplied by this diagonal matrix Λ multiplied by
V −1 , physically what it means is we have sort of rotated into the coordinate system which
is dened by all these eigenvectors and these eigenvectors purely do stretching.

So as I said before all matrices can be thought of as rotating and stretching vectors.
Eigenvectors are those vectors that are simply purely stretching. Now what we know is
real symmetric matrices and this is where we will use them have real eigenvectors and real
eigenvalues, this is not necessarily true of all matrices even if you have a real matrix and it
is not symmetric, it might or might not have real eigenvectors and real eigenvalues, okay.
In case you do have a symmetric matrix there is a nice factorization for it, remember
we have V ΛV −1 in the case of a real symmetric matrix you can write it as QΛQT , where
QT would be the same as Q−1 which means Q is orthogonal that was our denition of an
orthogonal matrix which also means Q is a rotation matrix. So physically what this means,
13
86
what this factorization means is so if I have Av and I am trying to determine what the
action of this matrix A is on v and let us say A is a symmetric matrix, we know from here
that I can write it as QΛQT v , okay.
So let us say we have some eigenvalues or eigenvectors, what this action does what QT v
does is it rotates v into the direction of the eigenvector, okay so that is what it basically
does. So you have two actions going on: there is rotation and then there is stretching, so
what the Eigen factorization does cleverly is this rotation is rst rotated into the eigenvector
directions, then you stretch it through the Λ and after that you rotate it back so that the
net rotation and the net stretching are put together into one matrix A and which can be
written as QΛQT .
So this of course will take a lot of visualization, I just sort of summarize it, if time permits
we will give some bonus videos towards the end of this course so that you can visualize it too
and maybe we will give some bonus codes that you can run to see this. So one important
thing for us to remember is the Eigen Decomposition might not be unique. For example
if you have the matrix the identity matrix, so let us take a 3 ∗ 3 identity matrix. For the
identity matrix, every vector is an eigenvector. why is that?
Because the identity matrix has only one action, it does not even stretch it basically
keeps the vector as it is you can think of it as a stretch by a factor of 1. I could also make
up another matrix let us call it A, so there is no rotation at all for this matrix A all it is
doing is stretching and it will do so for every vector.
Now one can think of a counter part for this if we think of a rotation matrix it is not
going to have stretching at all, it will not have stretching at all which means really speaking
that you cannot really have a real Eigen Decomposition because an Eigen Decomposition
only tries to nd out those vectors which are actually going to purely stretch so if I have
a pure rotation matrix or a pure orthogonal matrix it is not going to have a real Eigen
Decomposition, you can try this out for yourself.
14
87
So we are going to look at one very important idea that of a quadratic form, okay.
Remember when we are trying to nd out the length of some vector let us say x, the 2 norm
√
is square is x.x and if I look at 2 norm square, that is going to be x21 + x22 + ...x2n which I
can write as x.x or xT x. Now the quadratic form is a slightly weighted form of this, I will
show you what I mean by that it is written as xT Ax.
So let us look at what this means. Suppose I have A as n ∗ n matrix and x as n ∗ 1
vector then xT is going to be 1 ∗ n which means all put together you are going to get a 1 ∗ 1
number, which is a scalar. So a quadratic form is something that takes a matrix or takes
a vector x and gives back a scalar much like the length plus except it has a factor of A in
between.
Now what does this do? If you write it out if you write out this matrix product
 you will
x1 
basically get combinations of all sorts of terms. For example, let us say if x is 
  and A
x2
15
88
 
A11 A12 
is 
 , then xT Ax is simply going to bex1 A11 x1 + x1 A12 x2 + x2 A21 x1 + x2 A22 x2 ,
A21 A22
okay it is a sum of every possible you know linear combination of this as I have written here
xi xj Aij a simple summation of that is called the quadratic form we will see several uses of
this as we go on through the course.

Now one important denition that sort of comes from the quadratic form is that of a
positive denite matrix. A positive denite matrix is any matrix that has all completely
positive eigenvalues, we also have another denite called positive semi-denite, a positive
semi-denite means not just greater but greater than equal to 0. A positive denite matrix
has very nice property which is all quadratic forms so if you take any x at all, it does not
matter which x you take it is your choice x can be positive or negative, which ever x you
take xT Ax will always be positive.
A simple
 example
 is if the matrix A is identity. Notice identity is already a diagonal
 1 0 0 
matrix   which means all eigenvalues of identity are 1 which means it is a positive
 
 0 1 0 
 
0 0 1
denite matrix since all are positive, this will give us xT Ix since I is the matrix is the same
as xT x, so this is always positive as you can see. So for all x of course for all x not equal to
0 if I multiply it by 0 of course it is trivially 0, so a positive denite matrix has the property
that for all non-zero x, xT Ax will always be positive.
A positive semi-denite matrix has the property that for all x, xT Ax ≥ 0. So you could
have non-zero x which
 give x Ax
T
 = 0. So you can see a simple example of this let us say
 1 0 0 
a matrix B which is  , we can nd out some xT Ax or xT Bx which is equal to
 
 0 1 0 
 
0 0 0
16
89
 
0
 
0 for x = 
 0 , okay this is a non-zero x but if you multiply it this term will be 0 if you
 

 
1
write it out in this way you will you can automatically check that xT Ax or xT Bx is equal
to 0 even when x is not equal to 0.
So similarly you can dene negative denite and negative semi-denite matrices, so a
negative denite matrix has all eigenvalues less than 0, negative semi-denite means all
eigenvalues are less than equal to 0 and similarly the quadratic form xT Ax for a negative
denite matrix will always be less than 0 and for a negative semi-denite matrix xT Ax will
be less than equal to 0.
Finally we come to another decomposition, we will not be using it very often in this
course but it is a very useful idea and it is a useful idea when we discuss decompositions,
this is the generalization of the idea of factorization that we have just used you can think
17
90
of it as a generalization of the idea of the Eigen Decomposition itself, but we can apply this
idea to non-square matrices, okay.
So factorizing matrices we had done it based on stretch and rotation, so is the same idea
which is applied for singular value decomposition also. So let us say A is m ∗ n matrix, m
which is not necessarily equal to n so it can be a non-square matrix. So then you can write
the factorization of A as U DV T , where each of these have the following properties, U is a
m ∗ m matrix, okay so remember A is m ∗ n, U is m ∗ m, V is n ∗ n, so obviously if we need
to match the sizes we need a m cross D is a m ∗ n matrix, okay.

Now U and V have properties remember both of them are orthogonal which means both
of them are can be interpreted as rotation matrices. D is a diagonal matrix by diagonal
what does it mean it means that only the diagonal entries are non-zero in case it is not
square even these elements will be 0. So o diagonal entries are zero that we know for sure.
Now there is certain terminology here, the elements of U , this matrix U are called the left
singular vectors and they can be calculated as if they are eigenvectors of AAT .
Now notice one thing about AAT in case A is real, AAT is symmetric and since it
is symmetric we know if it is real and symmetric then it has real eigenvalues and real
eigenvectors, okay so this will always be real U 's elements will always be real for any real
A. Similarly elements of V they are just a switch, they are eigenvectors of AT A and they
are called the right singular values.

The non-zero elements of D the elements here are given as the square root this is the
square root of the eigenvalues of AT A, okay so these are called singular values. A singular
value decomposition has very similar not the same but very similar interpretation to what I
talked about in terms of the interpretation of the Eigen Decomposition. You take a vector
rotate it into by the V transformation you rotate it and D simply stretches and then you
rotate it back, okay.
So you can think of any matrix A as again doing simply a rotation as well as stretch-
18
91
ing and that is the signicance of a singular value decomposition atleast as far as we will
discuss singular value decomposition often called SVD, as far as this course is concerned.
If time permits towards the middle or end of the course we will probably provide a few
bonus videos with which you can actually visualize Eigen Decomposition and singular value
decomposition.
This ends the discussion of linear algebra atleast the separate discussion of linear algebra
for this course, in the next series of videos in next week we will be looking at probability
which is the next half of mathematics (that we are) that we require for this course, thank
you.
19
92
Introduction to Probability Theory Discrete and Continuous Random Variables
(Refer Slide Time 00:16)
In this video you will be looking at an introduction to probability theory and specifically will
introduce the idea of discrete and continuous random variables.
So probability is a mathematical framework for representing uncertainty wherever you have

some uncertain outcome, we tend to use probability as a mathematical representation of the
uncertainty in the problem. So in engineering systems, we have multiple sources that such
93
uncertainty occurs. Sometimes we have inherent randomness or stochasticity more
specifically in the system. For example, in quantum mechanics the lost themselves actually
lead to some amount of randomness or uncertainty. Or let's say you are dealing with a pack of
cards so you have a little bit of randomness thrown in there. So in such cases whenever you
try to predict something you are going to have a probability theory coming in. Now we can
look at a slightly higher level of abstraction, where you might have deterministic systems that
is the laws themselves are actually not random unlike quantum mechanics but you might have
incomplete observability which is you are you are not able to see all that is happening in the
system. For example, if you have a macroscopic description of let's say flowing a room, we
know that inherently there are molecules and within them atoms etc. You are not able to
observe them.
And typically this leads to a little bit of uncertainty in the properties. We know that
macroscopic properties are defined are derived from microscopic properties but this is done
so actually probabilistic even though in real life we don't treat them as if they are
probabilistic. You have multiple such examples. Whenever we have incomplete observer
ability once again you can use probability theory.
A third level of abstraction is you might not have randomness. You might even have in some
sense complete data plus you might have deterministic laws but still despite having full data
you actually have an incomplete model. The model can be incomplete as in let's say whether
models. So you do not deliberately you do not use all of the data in order to gets simple
models are tractable models in all these cases. In engineering you will typically use
probability theory.
94
Now our interest of course is we want to use probability ideas in machine learning and there
are two primary uses of that, first is in constructing learning systems themselves by a learning
system I simply mean a machine learning model. Ok so you want to construct a model so if
you try and mimic let's say human reasoning about uncertainty we say the probability of rain
is probably sixty percent tomorrow.
So inherently even within our models there is some probability built in. Ok so in order to
incorporate such probabilistic thinking you have probabilistic models. So notice this, you can
have probability right built in right into the model itself. Ok so that is one way of using
probability. Another idea is you might actually have a deterministic model for example as we
see many neural network models are almost by design they are deterministic so you could
have a deterministic Model.
That is how the input relates to the output is actually a deterministic process. Nonetheless the
output itself can be analyzed probabilistically because the learning system is only correct part
of the time. It's not correct all the time. So for example if you might see a Google Image
analyzer or any other image analyzer. Typically, the actual output in the algorithm as we'll
see later on in the course will not be a specific class.
It will not say this picture is a cat deterministically. What we can say typically something like
this picture is a cat with probability point nine. So this would be something like an analysis of
the learning system probabilistically. It might go wrong 90 percent of the time. Stuff like that
can be analyzed probabilistically. This is a probabilistic analysis of even deterministic or
even probabilistic models.
95
When we come probabilistic analysis. There are two interpretations of what a particular
probability means. Now all of us know probability lies between 0 and 1 but there are two
large schools of thought. They often come into philosophical fight. You will not get too much
into that in this course at least. But just for a brief, Introduction if you take a statement such
as there is sixty percent chance of rain tomorrow. This can be interpreted in two distinct
ways. So one way is what we are usually used to. This is called a frequentist model.
Frequentist model would be something like Ok so the temperature rise this much the pressure
today so much it's slightly cloudy and in all such cases before then such things happen in
sixty percent of the times it rained, So such a statement would depend on let's say something
like what you have observed so far. And that's a frequentist approach to probability.
It says it depends on the proportion of events in an infinite sample space. As we will see
shortly. It's typically an objective measure. So if I say what is the probability of a fair dice
throwing up to you will say if I throw this dice let's say millions of time two will come up
about one sixth of the times. So the probability is one six, This is an object to measure. The
second measure is called the Bayesian approach, Preferred typically by economists or even
philosophers. Ok so this actually measures degree of belief.
So if somebody says that there is a sixty percent chance of rain tomorrow what they mean
typically is it looks a little bit more than you know fifty fifty. So it looks kind of likely that I
am going to get about rain a little bit more than I but I am not really sure. So something of
that sort it's a rough estimate. That's something like a Bayesian. Yeah of course there are
more technical meanings it's not as bad as what I'm making it out to be but it's a subjective
96
measure. So there is a certain amount of degree of belief in the statement that's incorporated
in a Bayesian.
Now for our purposes we do not really strongly care in a couple of places. We'll make this
distinction but other than that we do not really strongly care about which approach we are
taking because whatever the probabilities result out of this the mathematics works exactly in
the same way. For example, if a doctor says to a person your probability of getting a disease
one let's say heart attack is 0.1 and probability of getting disease two, let us say foot ache is
0.2. Ok let's say that these two diseases are independent okay. This is important in the
example I am using. Now regardless of which interpretation of probability you choose.
It is always true that probability of disease one and disease two will leave they events are
independent. You are going to get 0.1× 0.2, which is 0.02. This is regardless of whether it's a
frequent test approach although there is a Bayesian approach. So the mathematics of
probabilities work exactly in the same way regardless of which approach we choose. So we'll
stick to you know choosing between frequent tests and Bayesian depending on what makes
sense and we'll only look at what the mathematics of the resulting probabilities.
So let's come to a few definitions which we'll be using. The first is the definition of a random
experiment the simple definition of random experiments you do an experiment and it results
in different outcomes each time. Despite you having similar conditions for example I toss a
coin, it seems to me that I am keeping the coin exactly the same way on my thumb in exactly
the same way, and yet sometimes you get heads and sometimes you get tails, such an
experiment is called a random experiment. So rainfall amounts, throwing off dice are infinite
97
examples of this. The second definition is that if a sample space ok so suppose you do a
random experiment all the possible outcomes the set of all possible outcomes of this random
experiment is called the sample space. For example,
Tossing of a coin , S={H ,t }
if you toss a coin the set of all possible outcomes is you either get a heads or you get a tails.
This would be if you are tossing a coin once. Now suppose I toss a coin twice,
Then the sample space Sis {HH , H T ,TH ,TT }. So this would be my sample space.
Now one of these four should have a occured when I toss the coin two times. Now what's
important is that the sample space which we use for determining probabilities depends on
actually the purpose of analysis. So the same event can be described in many different ways.
So let's say we have manufactured a pipe and we want to and knowing that manufacturing
has certain uncertainty built into it you are not always going to get a pipe of the same size
each time. So we can call this a random experiment. So what I want to describe is the sample
space of what kind of pipe did I get. Now depending on the purpose of my analysis our
sample space S could be S=ℜ+¿={ x∨x>0}¿ means the positive half of the real number line. So
this is simply S is some number which is positive. All of us know this. So all we are saying is
the diameter could lie anywhere between zero and infinity.
This is one sample space. Another possible sample space
S={low , medium,high }
is the diameter of the pipe was low or it was medium or it was high. Suppose we are only
interested in whether it is in one of these three. Is it too small. Is it kind of okay or is it too
big. If these were our only three plus quantities of interest are qualities of interest in an
analysis. Our sample space would be simply this okay. Or we could be basically interested in
S={satisfactory ,unsatisfactory }
is it a satisfactory pipe for my purposes or unsatisfactory. So my sample space simply has

two elements satisfactory or unsatisfactory.
The point is that you can describe the outcome of the same event in many different ways
depending on which way you wish to analyze it and as an engineer often later on when we
make machine learning models this becomes an important part of what role you play.
98
Describing the sample space. We come to a very important quantity. This is a fundamental
quantity often while using probability theory, The idea of a random variable. So it is useful
typically to denote the outcome of a random experiment by a number. Now notice for
example if I toss a coin my outcome my sample space for example was either heads or tails
heads, By itself it's not a number. Tails by itself It’s not a number. You could assign a
number to it. For example, you could assign the number one two heads and zero to tails.
Ok so then if you assign one number for example you could even have categorical outcomes.
For example, I take an image I ask is this a cat or dog horse or a picture of a cow. So then you
have four possible outcomes again cat, dog, horse, cow are by themselves not numbers but
you can assign numbers for example zero, one, two, three or one ,two, three, four etc.
Ok so you can even assign numbers numerical to a categorical outcomes Ok so the variable
that associates a number with an outcome is called a random variable. Ok so please notice
this random variable by itself is something that is mapped, to either the real number or the
integers etc. etc. notation, this sometimes gets confusing for students so please remember this
the variable itself is denoted by a capital letter. For example, X would denote a random
variable the variable itself its value is denoted by a small number. For example, if I say X
equals to 0.5. X is the random variable and 0.5 is the value that it takes.
Let's take another example. Suppose we want to find out the rainfall on a particular day. So
this is a random variable as you know we cannot say for sure what that exact amount of
rainfall would be. So let's call this random variable R. The amount of rainfall would be
99
actually denoted by r. So suppose I want to make the statement. What is the probability that
the rainfall is greater than 10mm, So I want to see how to denote this or what is the notation I
would use for this. Remember probabilities denoted by P the amount of rainfall as a variable
is R.
The actual value it takes which is 10 is denoted by r. So we would write something like the
mathematical notation is P(R>10)=? what is the probability that the amount of rainfall is
greater than 10mm. So P(R>10) . Ok so suppose I ask what is the probability that the dice
give me the 3. I would and if X was denoting the random variable which gives you the output
of the dice you would say something like P( X=3)probability that X is equal to three. So
suppose you have a uniform random variable a uniform random variable is one that all
outcomes are equally likely. For example, you have an unbiased coin. If you throw it you
either get a heads or a tails with a probability 0.5. So the probability distribution is called a
uniform distribution.
100
Similarly, for a Dice you could have a uniform distribution again, ok so let’s come to
probability distributions, A probability distribution tells us how likely a random variable is to
take each of its possible states. So remember a random variable can pick any state depending
on what the sample space is if the sample space has 10 members. Then the random variable
can take all 10 values any of the 10 values not all 10 values simultaneously. Any of the 10
values the probability distribution tells us that not all of them might be equally likely. Some
of them might be less likely. Some of them might be more likely. So that probability
distribution is what tells you how likely each one of these values is.
So depending on what kind of variable we are dealing with we might have two different types
of probability distributions. So very common probability distribution is that of a discrete
random but a very common random variable type. It's a discrete random variable. This has a
finite countably infinite number of possibilities. For example, if we look at the number of
errors in a particular page or the number of errors I make while speaking or the number of
errors doctor made in diagnosis all these are actual numbers. Ok so these are finite.
The range of the random variable we can take is actually finite. Ok so these are discrete
random variables. Now more importantly the probability is measured by what is called a
probability mass function as we see in the next slide. So we can have a continuous random
variable also which has a real number interval for its range. An example would be any real
number random variables such as temperature, pressure, voltage, current etc. In such a case
the probability is measured by probability density function. Please notice the difference for a
101
discrete variable it's a probability mass function for a continuous variable it's a probability
density function.
So let's come to a probability mass function. Once again it's done for a discrete variable. We
denote it by PMF standing for probability mass function. All it this is a list of possible values.
Okay so if the random variable takes multiple values it's simply a list of those values along
with their probabilities. So let's say you have a bias dice a bias dice means not all six sides
are equally likely. So one is somewhat likely two has a different probability etc.
So let's say we have these six probabilities.
P( X=1)=0.1 P( X =2)=0.1 P( X=3)=0.2
P( X=4)=0.2 P (X =5)=0.2 P( X=3)=0.2
Notice how I am denoting this. All I'm saying is P of X equal to one equal to point one etc
you will see this should be actually true. Please excuse me. So P of X equal to on this point
one P of X equal to his point one etc.. You have to give a probability for each possible
outcome in the sample space. So for example if I take a graph and have this six possibility
strong here which is the sample space please remember this is X. Then probability that x
equals to one is point one,Point one, Point two Point two, point two, so this essentially is the
probability mass function ok.
Some of you might notice that this looks like a point load which is exactly true. In structure
who might have seen something of the sort a point force applying at a single point.
102
So that's what the probability mass function is analogous to. So in order for a probability
mass function to be a valid probability mass function it has to satisfy certain criteria.
● Domain of P is the set of all possible states of X
One thing is you are to make sure that the domain of P is the set of all possible states of X.
All that means is that P should have some that of some valid value for each output for X. For
example if I don't give let's say these two values and say that P is valid only from one to four
then this is not a valid probability mass function. My whole sample space must be covered.
● 0 ≤ P( X=x)≤ 1
Next of course all of the individual probabilities since they are probabilities how to lie
between zero and one they have to be non negative. As well as they have to be less than one.
Finally
❑
● ∑ ❑ P( X=x)=1
x∈X
we know that the whole sample space, So since one of these X should definitely occur the
probability summation of individual probabilities has to be one.
So using these laws you can immediately come up with the fact that for a uniform random
variable which is a random variable where all the outcomes are equally likely.
1
● P( X=x i)= i=1, ⋯k
k
If there are K outcomes so X I, I goes from one to K. Then a uniform random variable each of
those probabilities will be equal to 1 by K. Ok so as I said earlier this is analogous to a point
Load.
103
So let's now come to continuous random variables. Remember that instead of a mass function
you now have a density function ok. D stands for density. What it is effectively is a
probability per unit length, once again you can make an analogy so instead of a point Load.
You now have something like a distributed load since this is a continuous function. We don't
have gaps between any two random variables.
What we have is a continuous distribution and instead of giving probability at the point what
we will give is probability per unit length or in other words probability density. This is like a
distributed load. So let's say R is the amount of rainfall and I want to find out probability that
the rainfall lies between ten and twenty.
20
P(10< R<20)=P(10 ≤ R ≤20)=∫ p( x)dx

10
As it turns out the probability of any particular point is irrelevant. What you look for is
probability in a range. So let's say you want to find out the probability that the rainfall is
between ten and twenty ok. In that case we simply denoted P(10 ≤ R ≤20). These two are
equivalent because probability of any exact specific value is effectively zero. The area of a
line is effectively zero. So this probability can be given by any under this curve just like for
distributed Load, Load can be given by area under the curve ok. So if you don't understand
that analogy even then you can immediately see that each of these probabilities when they
20
sum up if it is probability per unit length then this area is given ∫ p( x)dx
10
104
b
So in general the probability P(a≤ x ≤b)=∫ p(x) dx.

a
So in order for P to be a probability distribution function
● The domain once again just like last time has to be all possible states of X
And probability density has to be positive.
● ∀ x ∈ X , 0≤ p(x).Note that is not necessary for p( x)≤1
Notice that it's not necessary for the probability density itself to be less than equal to one
because this is a density.
So let me show an example, So let's say my probability density function looks like this it's a
hat function let's say this is 0, 0.5. Then this value this area has to be one therefore this height
has to be 2 ok. All I'm interested in is in making sure that the area of any sub portion has to
be less than equal to one because it's the area which is the actual probability. This small p of x
on the other hand is the density. what it is, probability per unit length. So I can arbitrarily
make this large by simply reducing the length for the same probability.
Our condition here is of course that

❑
● ∫ p( x)dx =1
X
An intuitive way of way of thinking about the probability density function is to think of it as
a normalized histogram. Okay so suppose I take a random number which lies between let's
say minus five and five and i take ten thousand such random numbers. So let's say here is the
histogram that's drawn. Notice these values here this value is eight hundred. This values one
hundred. So some that I don't zero you get a lot of hits.
And everywhere else we get a low number of hits. We can draw a histogram. Now suppose I
normalize this by what I mean by normalization is instead of looking at number of times I got
a zero or a number of times I got a value between that's a minus point one and zero instead of
doing that. I start looking at what fraction I got. So all these numbers if i divide them by ten
thousand what they now will get here is point zero eight. So I had ten thousand tosses which
went between minus five and five. So now I divide all these numbers and I look at fractions
instead of actual numbers. You see that they form a curve of this sort.
105
Now suppose I keep on increasing my tosses. You might kind of guess that slowly but surely
it'll start converging to some nice bell curve even see this later. Later on this week such a
curve will usually be a Gaussian curve but it will usually converge to some sort of curve that
curve would be the probability distribution function not for a finite number of throws but for
an infinite number of throws a normalized histogram tends to a probability distribution
function, Thank you.
106
Conditional, Joint, Marginal Probabilities Sum Rule and Product Rule Bayes' Theorem
In this video we will be continuing with our discussion of probability theory. We will be looking
at a few basic ideas beyond what we looked at in the last video. So we will look at conditional,
joint and marginal probabilities and two rules which essentially govern all of probability theory,
the sum rule and the product rule. And finally we look at just the definition and a simple
derivation of the Bayes' Theorem. We will look at Bayes' Theorem in greater detail and then next
video after this one.
So today we are going to look at just the basics.
So a quick acknowledgment, several of the ideas and the pictures in this lecture have been
borrowed from the book by Dr. Christopher Bishop. You might remember that this one of the
references for this course. The book itself is available freely on the web, legally freely available
on the web. This has been made available by the courtesy of Microsoft Research, Dr. Christopher
Bishop and his finger. I also want to mention that these slides and the pictures in the slides not
the slides themselves have been, many of them have been borrowed from Dr. Christopher Bishop
with his kind permission.
107
So the topics that we are going to look at, as I said before, are joint probability, marginal
probability, conditional probability. All these three are simple ideas when you have more than
one variable. We were looking at cases with one variable in the last video, now we are going to
look at more than one random variable. And two of the rules that govern all of probability theory
and finally Bayes' Theorem.
108
So here is a simple example. We will be using this example throughout this particular video. This
is again from Christopher Bishop’s book. So imagine that you have two baskets. One of these is
a red basket and one of them is a blue basket and each of these baskets has some fruits. The
orange ones you can assume are oranges and the green ones we will assume are apples. Why
green? Because the basket is red just for clarity. So let us say we have these two baskets and our
task is to randomly put our hand into one of the baskets and pick out a fruit.
So let us say that all fruits within a basket are equally available. Even though for clarity we have
drawn as one fruit is at the top and few others are to the bottom. Assume that all of them are well
mixed and so if you are going to put your hand in one of the baskets, you will randomly pick out
one of these fruits with equal probability. So this basket therefore has six oranges and two
apples. This basket here has three apples and one orange. Further assume that your choice of one
basket or the other is not equally probable but that picking up, let us say the red basket you pick
up with the probability of 0.4, that is 40 percent of the times you will pick the red basket and 60
percent of the times you will pick the blue basket.
So you are not going to pick a basket with equal probability. Notice this notation,
P(B=r)=0.4 P (B=b)=0.6
probability of B equal to R, this we looked at in the last video.
Please notice the random variables that we have here. The random variables we have here are
109
● B: which basket we pick and
● F: which fruit we pick.
So instead of the cases which we looked at in the last video, we are actually going to look at a
case where you have not just one but two different random variables. That is the basket you pick
and further on which fruit you pick within the basket.
So let us consider this case. Once again the same example. If we look at the random variables in
this case, the basket B has the sample space, either you pick blue, sorry, yeah either you pick
blue or you pick red. So these are the two possibilities. Amongst fruits you can either pick
oranges or you can pick apples.
● B: {b ,r }
● F : {o ,a }
So now we can ask multiple questions.
For example, you could ask,
● what is the probability of picking an orange?
Now clearly the probability of picking an orange in this basket is different from the probability
of picking an orange from this basket. But randomly if you just put your hand in one of the
110
baskets and pick a fruit, what is the probability that it is going to be an orange given that I pick
red basket with probability 0.4 and blue basket with probability 0.6. This is one simple question
we could ask. You could ask slightly more complex question like
● what is the probability that I picked the red basket given that the fruit I picked was an
orange?
So this is a classical conditional probability question. You close your eyes, pick up a fruit, it
turns out to be an orange.
And now you want to know did you pick it up from the red basket or from the blue basket. You
can see that since oranges are more prevalent in the red basket, it might seem like that is a little
bit more likely. So we can ask such questions and we can ask far more complex questions. All of
these, currently we are only looking at discrete probability examples but all of these are
indicative of the kind of questions we will ask later on even within the machine learning context.
So let us come back here and let us take a case where I take N = 100 trials. And we are going to
assume that the number of cases where you are going to pick red turns out to be exactly 0.4 times
100. This turns out to be true only when actually N tends to infinity but we will assume that
everything comes out exactly as if the probabilities are working out as fractions. So let us make a
quick table(shown in figure). So if I make 100 trials, so remember I have two random variables,
111
the basket. So I could have chosen the red basket or I could have chosen the blue basket.
Similarly the fruit could have been an orange or the fruit could have been an apple.
So now out of 100 trials, we want to know.
So this is the case where the basket I picked was red and the fruit I picked was an orange (r,o).
This is red and an apple (r,a). This is blue, orange (o,b), and blue, apple (b,a).
So let us try and find out how many of each of these cases occur ?
So I know that in the case that I have 100 trials the basket will be red for total of 40 times.
Similarly the basket will be blue a total of 60 times. Now the 40 times that I pick red basket,
suppose I want to know in how many of the cases will the fruit be an orange and we can see
automatically that assuming it works out exactly according to the probability sixth-eight of the
cases which is 30 of the cases you are going to get an orange.
Two-eight of the cases, ten of the cases you get an apple. So a red basket with an apple occurs
ten times. Now let us find out a similar case here, 60 of the cases are the blue basket, within that
an orange comes one-fourth of 60 which turns out to be 15. And we know now that the rest of
the 45 cases we must actually be picking an apple. So this is a table which tells you how many
times each of these cases occurs. You can also add this up and get the case that out of the 100
times, when I say 100 trials what does it mean? In each trial you pick the basket and chose a
fruit. So amongst those 45 of the times we actually picked an orange and 55 times we actually
picked an apple. So we can put this table together in this way, so I will be repeating this table in
the future slides. So you can see basket is red, basket is blue and I have written this table out
which tells which, how many of these cases, remember each of these actually indicate an
intersection of the two cases, basically both these cases occur together.
112
So such distributions are called joint distributions. Right now I have written the numbers. The
probability obviously of each of these cases is going to be this divided by 100. So you can do
0.3, 0.15, etc if you want the probabilities to work out nicely in this case.
Now we can generalize this to two variables. Let us say you have a variable X and a variable Y
just like in this case we had B and F, basket and fruit.
X : xi , i=1 ,⋯ , m
Y : y j , i=1, ⋯ ,m
You can have two general variables X and Y. In the example case, the basket B, the random
variable B had only two choices, it was either red or it was blue. But you can imagine a case
where you have many many more possibilities. So you have xi and let us say i goes from 1 to
some state m. So X instead of two possibilities has many possibilities, let us say m possibilities.
And Y also has some n possibilities. In this case we have chosen something likem=5and n=3.
But you can obviously choose different numbers.
So your table, instead of a 2× 2 table which I have shown here, you will have m× n table of how
many times does each case occur. So you take a large number of trials N and really speaking you
will get the right fractions only as N → ∞ unlike the pseudo example I took last time or in this
case where I have taken 100 and assumed that it works out. Typically you have to take a very
large number n in order for the trials to work out exactly according to their probabilities.
113
So if we take that, we can define something called the joint probability.
Joint probability is the probability that X will take desired value xi and Y will take some
desired value y j . For example, in our case I could ask something like what is P(B=r , F=o)the
probability that the basket is red and the fruit is orange. So that would be an example of a joint
probability.
So you write it under the notation,
P( X=x i ,Y = y j )
So if we want that, how would we do it? Let us say I want this.
30
P(B=r , F=o)= =0.3
100
I would say this is 30 divided by the total number of trials which was 100, so this is 0.3.
In the general case let us assume that this box here xi , y j it has nij entries. You can think of it as a
matrix. Let us say the matrix N has nijentries there. Then the probability
nij
P( X=x i , Y = y j )=
N
where N is the total number of trials. So similarly you can ask what is probability,
45
P(B=b , F=a)= =0.45
100
114
So now using this we come to an important rule called the sum rule. The sum rule asks the
question which is, if I do not want a joint probability but I simply want the question, what is the
probability that the basket is red(P(B=r))? Or I could ask the question, what is the probability
that the fruit is an orange (P (F=o))? Now you can see this immediately. The probability that
the fruit is an orange is going to be 30 of the cases where the fruit was an orange and the basket
was red, 15 of the cases where fruit was an orange and the basket was blue, which means a total
of 45 cases.So the probability that the fruit is an orange is going to be 45 divided by 100.
45
P(F=o)= =0.45
100
So here Ci basically is the sum of this column which is what we got here, regardless of what
value of Y was taken. So we take a summation over all the possible values of Y. In this case all
the possible baskets that the fruit could have come out of and the total there is called Ci. And the
total number of trials obviously remains the same. So now we automatically see this is called the
marginal probability. I will come to the reason for that name shortly.
ci
Then, P( X=x i)= : Marginal probability
N
We can automatically see that
❑
c i=∑ nij , j=1 , ⋯ n
j
115
So therefore we can write
❑ nij
⇒ P( X=x i )=∑ ,
j N
nij
P( X=x i , Y = y j )= : Joint Probability
N
❑
⇒ P( X=x i )=∑ P( X=x i , Y = y j ): ∑ rule of probability
j
So this is called the sum rule of probability. It is a very important rule. We will be using this
multiple times.
This probabilityP( X=x i) here is the marginal probability. This summation is sometimes called
the marginalization. So we say the marginal probability, P of X is the marginalization of the joint
probability. The terminology might look a little bit confusing but the idea is very very simple, it
is just like this. Now why is it called marginal probability? If you notice these numbers, these
total numbers are written in the margin. That is the historical origin. Marginalization does not
mean anything else, it simply means that the columns or the rows have been summed up and you
have put them the total in the margin which is why the total probability is called the marginal
probability.
116
The next idea which you will most probably be familiar with already is the idea of conditional
probability. So let us say we ask a question, instead of just asking what is the probability that the
basket is red and the fruit is orange, you can ask a question similar to the one that I asked in the
beginning of this video, which is
● What is the probability that the basket I picked is red given that the fruit was orange?
So please understand sometimes people get confused between conditional probability and joint
probability. We will see that shortly in the next slide. That, the two are slightly different.
So there is one question which is, what is the probability that the basket I picked was red and the
fruit was orange? And the second thing has some information removed, it says that the fruit you
picked you finally see is an orange and now you want to know which basket it came out of. Did
it come out of the red basket or did it come of the blue basket? We will be using this idea a little
bit later just to orient you when you come to interpreting images. So I can ask a question like
what is the probability that this is a dog given the certain values of pixels that I am giving. So
that is the way we will be utilizing the idea of conditional probability later on.
What is the probability of given output given a certain input, P of Y given X?
So this is the way we write it.
P(Y = y j ∨X=x i)
117
This is read of probability of Y given X, this is the way we usually say it orally, you should be
familiar with this kind of language already. So P of Y is equal to y j given that X is equal to xi . So
in our case this would be written as P(B=r∨F=o).Now intuitively let us answer this question.
So what we said that we are given that the fruit is an orange and out of the trials we know that 45
of those cases are oranges. Now out of those only 30 cases was the basket red which means our
nij 30
probability is going P(B=r∨F=o)= = .
ci 45
More generally we would write it as
nij
P(Y = y j ∨X=x i)= : Conditional probability
ci
where Ci is the total or the marginal sum of that particular column.
Now here we come to extend this idea to the idea of a product rule. So remember that the
nij
● The conditional probability was P(Y = y j ∨X=x i)= .
ci
nij
● The joint probability was P( X=x i , Y = y j )= .
N
118
You can see that the numerator is the same because we are interested in the same case where the
red and the orange occur. But the denominator here when I am talking about joint probability of
both these occurring is the total number of trails. Whereas what I know here is a little bit
stronger. So I know in this case that the fruit is already an orange.
In this case I do not know what the fruit actually is, so which is why I am dividing by total N. So
now if we see this, we can therefore write,
nij nij c i
= × Conditional probability × P (X =xi )
N ci N
⇒ P( X=x i ,Y = y j)=P(Y = y j ∨X=x i) P( X =xi ): Product Ruleof probability
So this rule is called the product rule. you can even see people you chosen this notation very
carefully. You can even think of this as if it is a division sign to that you can think of this as P of
X, Y; P of this by this multiplied by this. So that is just for memory sake, otherwise this rule is
called the product rule of probability.
❑
P( X=x i)=∑ P( X=xi ,Y = y j ): ∑ rule of probability
j
P( X=x i ,Y = y j )=P(Y = y j∨ X=xi ) P( X=x i ): Product Rule of probability
119
And these two put together summarize the two important rules of probability which we will be
using again and again and again. Practically every theorem in probability can be derived from
these two rules, at least the theorems in probability that we will be using.
So we simplify the notation.
Please notice these Xi and Yj, you do require it in order for rigor but we will not use them, we
will actually use a simplified notation. We will simply call this P of X and drop the equal to Xi
and equal to Yj. You have to understand what it means according to the context. So this way it is
a little bit easier to see. P of X is sigma of P of X, Y and P of X, Y is P of Y given X multiplied
by P of X which is the product rule.
Simplified Notation:
❑
∑ Rule : P( X)=∑ P( X , Y )
y
Product Rule : P( X , Y )=P(Y ∨ X) P( X )
120
Now we come to Bayes' Theorem which is a simple consequence of the rules that we have seen
so far, particularly the product rule. It falls out naturally, you would be familiar with Bayes'
Theorem. Even earlier we will be using this several times extensively throughout this course. I
will just show a simple derivation right now and we will look at examples in the next video. So
let us start with the product rule,
P( X , Y )=P(Y ∨X ) P( X)
By switching the variable,
P(Y , X)=P( X∨Y ) P(Y )
⇒ P( X , Y )=P(Y , X )
By equating RHS, we get
⇒ P(Y ∨X ) P( X )=P( X∨Y ) P(Y )
P( X ∨Y ) P(Y ) '
⇒ P(Y ∨X )= : Bay es Theorem
P( X )
So this is called Bayes' Theorem. It is an extremely useful theorem, sometimes has very
counterintuitive results, or at least non-intuitive results if not counter intuitive. We have, we can
121
get non-intuitive results as we will see in the next video. So we will see both examples of the
basket problem and another problem the next video just in order to see how Bayes' Theorem can
be used.
122
Bayes' Theorem - Simple Examples
In the last video we had seen quick derivation of Bayes' Theorem. In this video we will be
looking at couple of very simple examples and both of these are discrete although in the course
we will be using certain continuous distributions with Bayes' Theorem. These are very very
elementary examples. So in case you are already comfortable with Bayes' Theorem, this is just
supposed to be a review, so you can easily skip this video.
123
So let us return to the old problem that we were looking at, there were two baskets. One of them
which was red and one of them which was blue and we were looking at the case where there are
oranges and apples. And 6 oranges and 2 apples here; and 3 apples and 1 orange in the blue
basket. And we randomly pick a fruit out of one of these baskets. What we know is that the blue
basket is picked slightly more times, so we pick the blue basket with probability of 0.6 and we
pick the red basket with the probability of 0.4.So this was the background of the problem.
And let us go a little bit further then what we did in the last video and let us ask a few questions.
So the questions illustrated here are
● Write down all the conditional probabilities P(F∨B).
All the possible conditional probabilities for the random variable F, represented fruit and the
random variable B representing the basket. Then we have a couple of more questions.
● If you pick a fruit at random, what is the probability that it came out of the blue basket?
So let us say you dip hand in and you close your eyes and you do not know which basket you
are putting your hand in.And you are picking a fruit out of random. What is the probability that it
came out of the blue basket? And the second one is essentially a variation of the previous
question which is
124
● If you pick a fruit at random and it out to be an orange, what is the probability that it
came out of the blue basket?
Let us start with this. Remember that we had drawn the joint probability table. Actually we had
drawn the joint distribution numbers but now I have turned it into a probability table. If you
review from last time, we had a total number of 100. If you divide each of those numbers by 100,
you get the probabilities. Remember what this represent.
P(B=r , F=o)=0.3, P( B=r , F=a)=0.1 , P(B=b , F=o)=0.15
P(B=b , F=a)=0.45
So let us address the first question. You want to write down all the conditional probabilities,
P(F∨B).
So as you can see this takes four different possibilities. We have two possibilities for the fruits
and two possibilities for the baskets. So you could have an orange coming out of the red basket,
you could have an apple coming out of the blue basket, so on and so forth.
So how do we find this out?
We have the elementary expression for conditional probability which we looked at in the last
video. This is actually just a rewriting of the product rule.
125
P(F∨B)=P( F , B)/ P(B): Product Rule
So if you use that, now you can take a specific case, probability that the fruit is an orange given
that the basket is red P(F=o∨B=r). So what you are told is you have chosen the red basket,
now you want to find out the probability that the few fruit is an orange.
P(F=o, B=r) 0.3

⇒ P( F=o∨B=r )= = =0.75
P(B=r) 0.4
This is the sort of detailed way of calculating this. We could have also calculated it in a simpler
way. You can kind of do it by inspection. Given that the basket is red, you already know that
there are 6 oranges and 2 apples. So the probability becomes 6/ 8=0.75. So that is the second
way of doing it. Similarly you can now find out all the other cases.
P( F=a ,B=r)
P(F=a∨B=r)=0.25=
P(B=r)
P(F=o∨B=b)=0.25
P(F=a∨B=b)=0.75
Probability that the fruit is an apple given that the basket is red, again you can do it by
inspection, it is 0.25. Also, I would recommend that in case you are not yet comfortable with it,
do it by product rule also. So just to guide you through it, this would be probability that fruit is
an apple and basket is red, divided by the probability that the basket is red which comes to 0.1
divided by 0.4 which is 1 by 4. So similarly fruit is an orange, basket is blue. Given that basket is
blue, if you come here, this is 0.15 divided by 0.6, that is also 0.25. You can again do it by
inspection also. Similarly probability that the fruit is an apple, given that basket is blue, 0.45 by
0.6 that is 0.75, again you can see this as 3 by 4 also.
126
P(F=o∨B=r)=0.75 P(F=a∨B=r)=0.25
P(F=o∨B=b)=0.25 P(F=a∨B=b)=0.75
Now it is worth noting that you can obtain this number here which is essentially P(F=a) using
the sum rule. So recall that the sum rule was the marginal probability that the fruit is an orange
can be given
❑
P(F=o)=∑ P (F=o∨B) P(B)
B
So in this case the orange fruit could have come out of one of two possibilities. Either it could
have come out of the red basket or it could have come out of blue basket.
You can now write this in detail.
⇒ P( F=o∨B=r ) P(B=r )+ P(F=o∨B=b) P( B=b)
⇒ 0.75 × 0.4+0.25 ×0.6=0.45
So if you sum this up, it again retrieves the same number 0.45. Then why do this? Because I just
wanted to show that you can recreate all these cases from joint probabilities or from conditional
probabilities. You can similarly calculate the fact that
127
P(F=a)=0.55( Exercise)
Once again if you are not quite comfortable with this calculation, I would recommend that you
do it in detail.
So let us move on to the second question.
If you pick a fruit at random, so you just dip your hand and pick a fruit at random, you do not
have any information about the fruit but you just want to know what is the probability that the
fruit came out of the blue basket. Now this is more or less a trivial question. You know that the
basket blue, blue basket is picked out at a probability of 0.6, so the probability is indeed 0.6.
There are a few things that I would like to point out here.
● The first thing is if you do not know the identity of the fruit which is basically this
example, the process of picking would simply give you probability, this number,
probability that the basket is blue which is 0.6. This probability is called the prior
probability. So please note this term, prior probability. Prior here refers to the fact that
you are telling a probability for picking that fruit or that which basket that fruit came out
before knowing which basket it actually came out, before knowing the identity of the
fruit.
128
So this is prior to knowing the identity of the fruit, you are identifying the basket out of which it
came and that probability is quite simple. In fact, it was given to you before. And this probability
simply 0.6, it might seem like I am saying too much about something that does not need
explanation but you will see that this is useful as we move on forward.
So once again you could have obtained this probability, P(B=b)simply using the joint
probability table.
129
Now let us move onto the next part of the same question which is,
● If you pick a fruit at random, so let us say you close your eyes, pick a fruit at random,
look at it and now you know that it is an orange. Now you have some extra information
before the previous case, before case 2 where you did not know the identity of the fruit or
you had not yet looked at the identity of the fruit. Now if you look and see that this is
actually an orange, now what is the probability that it came out of the blue basket?
The certain thing here is that the probability changes because you have more information now.
So to give you an example, for example I ask you, if you are walking on the road, what is the
probability of meeting an Indian? Now if I give this question randomly to a person in the world,
we know that Indians are about one-fifth of the population. So randomly walking on the road
without knowing the information about where you are walking, your probability of meeting an
Indian, a person of Indian origin is about 0.2, about one-fifth.
However if I tell you that I am walking on the road and I am walking in India, so now I have
given some extra information, if I tell you that I am walking in India, what is the probability of
meeting a person of Indian origin on the road? This becomes really really really high because
majority of the population of India is basically in people of Indian origin. So similarly even
though you pick the fruit, if you did not know what the identity of the fruit was, then the
probability of the fruit coming from a blue basket is simply 0.6. This is the prior probability.
130
Now we are asking for what is known as the posterior probability, that is knowing the identity
of the fruit and we know that the fruits are distributed differently in both these baskets. Now you
want to know what the probability is that it came out of the blue basket. So mathematically we
are asking for the probability that the basket is blue given that the fruit is an orange
P(B=b∨F=o). So this is a classic case for Bayes' Theorem.
So remember Bayes' Theorem is
P( X∨Y ) P(Y )
P(Y ∨X )= , P( X , Y )=Joint Probability of X ∧Y
P( X)
Okay. So we will use this. Here y is, basket is blue and x is essentially fruit is an orange. So if
you write the expression out and evaluate it,
P(B=b∨f =o)=P( F=o∨B=b)( PB=b)/ P( F=o)
(0.25 ×0.6)/0.45=1/3
So there are several ways of calculating it and again if you find something else is a little bit more
intuitive, please do it that way just to reassure yourself that the calculation is correct. Most
importantly this is a very very elementary example of the use of Bayes' Theorem. We can easily
say what this probability is, probability that the fruit is an orange given that the basket is blue is
clear and very easy to say is one-fourth.
The other one is a little bit harder and we usually use Bayes' Theorem in that direction. So the
difference between question 2 here and question 3 is simply this difference of not knowing the
identity of the fruit in which case the prior probability was 0.6 and knowing the identity of the
fruit in this case actually reduced the probability to one-third. So you can see posterior
probability as a modification of the prior probability. So we will use this viewpoint as we
move on in the course. So before you knew anything, the probability was 0.6. After you knew
something that comes out of the process, the result of the process, your probability is little bit
modified.
131
So here are some terminologies that we will be using for the rest of the course. Remember just a
repetition of what Bayes' Theorem is,
P( X∨Y ) P(Y )
P(Y ∨X )=
P( X)
Now notice that usually as we will see later on in the course, maybe about the fourth of fifth
week of the course, we are more interested in the numerator in many many cases.
P(Y ∨X )α P( X∨Y ) P(Y )
So this is simply, this stands for proportional to as you might know.
So we know that the posterior probability is proportional to P of X given Y, multiplied by PY.

Now each of these terms has a name. P(Y ∨X ) as I said earlier is called the posterior probability.
P(Y ), before you knew anything, before you knew about the state of particular X that you are
looking at is called the prior probability.
⇒ Posterior α Likelihood × Prior
P( X∨Y ) is called the likelihood. So this is some standard terminology that we will be using. So
likelihood multiplied by prior is proportional to the posterior probability.
132
So let us take another example, again a very very simple example. If you have not seen this
before, the results can be a little bit non-intuitive. So let us say a person goes to a cancer
diagnosis center for a test and the test turns out to be positive, in this case positive actually
means that the person has cancer. At least the test says that the person has cancer. Now
obviously this person is going to be worried. Now what are the mathematical questions or at least
numerical questions that the person can ask or must ask in order to determine the accuracy of the
diagnosis.
And given that the test turned out positive, what are the chances that the person actually has
cancer? Obviously the numbers here depend on the answers to the previous question. So let us
look at this.
133
So one question that anybody would ask is how accurate is the test, but this is kind of a wage
question. So you would need little bit more specification in order to, specificity in order to
answer the question. So you ask a more specific question, and let us say the person asks
● What percentage of people with cancer test positive?
So this is an intelligent question to answer. And the person doing the test, the test agency tells
this person that this is 99 percent of the people who have cancer do test positive.
But this is not sufficient because the test could always say the person has cancer and you could
still test positive most of the times. So instead you also need to ask. So these are true positives
that is people who have cancer, who are testing positive. Now you also ask the other way, that is
the flip side of this question which is
● How many percentage of people without cancer test negative?
Now these two informations do not flow from each other. So please notice this and please think
about this.
It is not obvious how many people without cancer will test negative. Like I said your machine
could be broken and it could be always saying cancer in which case all the people without cancer
will still test negative even if 99 percent people are accurate here. And the answer that the person
says is that 99 percent of the people without cancer also test negative. Now this could be very
134
worrying to the person because now more or less if you just go by intuition it seems like that
seems like 99 percent chances of having cancer if you test positive.
But it is not true because you have to ask one more question which is the non-intuitive part here,
● Which is what percentage of the population actually has cancer?
Now this might not like I said, seem intuitive but you will shortly when we come to Bayes'
Theorem, this is a very important quantity. So just as estimate, let us say that about half a percent
of the population actually has cancer.
So let us go ahead and move and try and calculate via Bayes' Theorem what the chances are of
having cancer. Before that we should ask the question,
● What are the actual random variables in this problem?
Remember the basket example, in the basket example the random variables were which is the
basket I picked and which is the fruit I picked.
So similarly in the cancer diagnosis question the random variables are,
does this person actually have cancer or not?
● State of Disease D :{C , NC }
135
So which I will call state of disease, D. D is a random variable, the person might have cancer or
it might be a person without cancer. So C stands for cancer, NC stands for no cancer.
The second random variable is the result of the test.
● Result of the Test T : {+,−}
The result of the test could either be positive, or it could be negative. So you have two random
variables, the state of the disease and the actual result of the test.
So what are we given about the test?
We are given the probability that you be test positive given that you have cancer is
P(+¿ C)=0.99. So this is the first number that you have been given. The second number you
have been given is that the probability that you will test negative, given that you do not have
cancer is still P(−¿ NC )=0.99. You can of course, find out the compliments of these two. So
what is the probability that you will test negative, given that you have cancer? This is
P(−¿C)=0.01. What is the probability that you will test positive given that you do not have
cancer? This isP(+¿ NC) also 0.01.So the third thing that we asked for is P(C)=0.005which is
what is the probability that a random person has cancer. Let us go back to the example that I told
about finding a person of Indian origin while walking on the street. So finding the probability of
finding an Indian person while walking down the street when you have given no context at all
about where this person is walking is what is called prior probability and that would be about
one-fifth.
Similarly without telling the origin of the basket from which the fruit came, if I just said I picked
the fruit, what is the probability that it came out of the blue basket, that is the prior probability.
That was 0.6. In this case without telling you what the result of the test is, without knowing test
results, the prior probability of a person having cancer randomly out of the population is 0.5
percent which is 0.005. So the question we are actually interested in is the flip side of this. We
know the probability of testing positive given that you have cancer. What we want to ask is the
reverse question which is, what is the probability of having cancer given that I tested positive or
P ¿?
136
So let us use Bayes' Theorem as before.
P¿
Now this can be opened out using the sum rule of probability which we have seen a couple of
times.
⇒P¿
So we know that this is going to be all the possible states of the disease, okay. So if we open it
out, diseases either cancer or the state of disease is you either have cancer or you do not have
cancer.
⇒P¿
So let us write these numbers out. This number we know to be 0.99. PC is the probability that a
random person has cancer. This is the prior which was 0.005. P plus+ given C 0.005 notice that P
of positive without having cancer was, as we saw in the previous slide, 0.01.
And the probability of not having cancer is of course, 1 minus- probability of having cancer. So
this comes to 0.995. If you calculate these numbers,
137
¿(0.00495)/(0.00495+0.00995)=0.33
you will notice something, numerator is 0.00495. The first number here is 0.00495, the second
number is 0.00995 and surprisingly enough you get probability of cancer given that you tested
positive is just 33 percent. Even though the test was seemingly 99 percent accurate, we got a
probability of only 33 percent that you have cancer given that you tested positive. This is
remarkable. And why does this come? If you look at the numbers here, you can find out that
basically it comes due to the low prior. What do I mean by this?
So let us see this with numbers instead of probabilities. So if you use a numerical, frequentist
approach to just finding out what this probability is.
P¿
.So probability of cancer even that you tested positive would be if you calculate it using number
of people, it is the number of people with cancer who test positive, divided by the number of
people who are actually testing positive. So what we want is No . of people testing positive
actually includes both cancer and non-cancer cases. So you go for a test. Some people who test
positive, would have had cancer. Some people who tested positive would not have had cancer.
138
We want only those cases who tested with positive and that is the probability that we want. So let
us consider a population of 10,000 people who go to the test. Out of these, if it is a random
selection of the population who go to the test, out of these
● The people with cancer is going to be 0.5 percent which is only 50 people.
So remember this net population is 10,000, people with cancer is 50. Now out of these people,
people with cancer, the number of people who will test positive in the test, the test is extremely
accurate, 99 percent of 50 is going to give you approximately 50. All 50 of them will test
positive.
So cancer and positive is going to be 50. Test is nearly 100 accurate.
So let us look at the other case. So out of the 10,000 people, 9,950 do not have cancer. They also
go for the test and out of them 1%, not a large percentage, only 1% test positive. But since the
number is large, this comes to 100 people will test positive. So please notice this. If you have
cancer, only 50 people with cancer are testing positive and 100 people without cancer are testing
positive.
So if somebody gives you the information that you tested positive, you can now see that that is
not very significant information because 100 people without cancer tested positive. Twice as
many people without cancer tested positive than people with cancer. So this basically the number
P¿
comes out as this is one-third because this is twice. Now it tells several things. One is this
number was this large because this number was so small.
If 50 percent of the population had cancer, both these numbers would balance out and you would
get very reasonable numbers here. The other thing you notice is what affects the test really is
false positives. That is, the true positives are very good. All the cases with cancer testing
positive is very good but this data is being contaminated by all the false positives, all the people
without cancer who are still testing positive. So if you want to improve your test, you need to
change P(−¿ NC )this number. If this number is much higher, this number will become much
lower. So this is another example of Bayes' Theorem.
139
In our future videos, we will be looking at more direct applications of Bayes' Theorem through a
continuous distributions etcetera. This was just supposed to be a simple review of how to use
Bayes' Theorem. The important takeaway here is the importance of the prior. So as we will see
later on in many cases priors are assigned arbitrarily but they can actually be a nice knob to turn
in order to get the kind of results that you want. We will see from the fifth week onwards for this
course. Thank you.
140
Indian Institute of Technology Madras
Independence Conditional Independence Chain Rule Of Probability
In this video we will be looking at some further ideas on probability. Specifically, we will be
looking at three simple ideas, that of independence which you will be already familiar with
from school probability.
We will also be looking at conditional independence which you might or might not be
familiar with. And finally we will be looking at chain rule of probability.
141
So independence as a notion, all of us know that two random variables are independent
means when an event X happens, its, it has no bearing on whether event Y happens or not.
A simple example would be, suppose I toss a coin and it gives heads or tails. This has no
bearing on me tossing another coin and what outcome will come out of it.
So such random variables are called independent random variables and the mathematical
condition
under which two variables x and y, we would say they are statistically independent iff
p( x , y)= p(x) p( y).
142
To be a little bit more precise,
we have to use a little bit more formal notation.

p( X=x , Y = y)= p( X=x) p(Y = y) ∀ x ∈ X , y ∈ Y
Though the calculation is actually this,
you should say that for every possible x which belongs to
143
event X
and every possible value y which belongs to event Y.
144
For example, I am tossing one coin on one hand and another coin on another hand, the event
X would be the outcome of the left hand coin which could be either heads or tails. Similarly,
the event on the right hand would be either heads or tails.
Now for each of these events, {HH , HT ,T H ,TT }, all of them should work out such that
P( x , y) which is a joint probability of x and y should be the product of the individual
probability, or product of the marginal probabilities; p( X=x) p(Y = y).
Now we have a simple example, the example I gave or you could say
● X is the event of throw of a dice, and
● Y is the event of toss of a coin.
145
So, of course
● X can take 6 independent values, 1, 2, 3,4,5,6 and
● Y has two possible values heads and tails.
Now each of these combinations, you know 1 and heads, 2 and heads, 3 and heads etc, etc,
the 12 individual probabilities, all of them should obey this law, Ok.
So all those joint probabilities should obey this law and we know that they will because
notionally, physically we have an idea obviously that the event X and event Y are
independent.
This definition is also a good idea for you to find out whether two events are independent or
not. Suppose, you know like we saw the previous video, suppose we have a joint probability
table known as say X has values x1 , x2 , x 3 and Y takes values y 1∧ y2.
So we have all these joint events ( x1 , y 1 ),( x1 , y2 ),(x 2 , y 1 ),( x2 , y2 ),(x 3 , y 1 )∧(x 3 , y 2)
146
so if you draw this and suppose all you have are all these joint probability
values. Then you can find out the marginal values of probabilities X and Y by simply adding
these up.
And then check whether p( x , y)=P= p( x) p( y )?, and that gives you a check for whether
the event is, two events are independent or not. Remember that the condition given is the
147
if and only if condition, Ok.
So if two events are independent then you will get p( x , y)= p(x) p( y). And only if this is
true do we consider the two events to be
independent.
An example of two measurements which are not independent are height and weight. Or two
random variables which are not independent are height and weight. So even though, you
know you could have a very tall person who is very, very slim and you could have a short
person who is very, very stout, in general as height increases weight will increase.
148
So the random variables x and y are actually correlated random variables. They are not
independent random
variables. This definition of independence is equivalent
to giving a conditional probability statement also.
So for example if two events X and Y are independent then I will say that
p( y∨x)= p( y)∨ p(x∨ y)= p( x).
What does it mean? Physically it means that the dependence of y on x here, you know
whether (Refer Slide Time: 04:50)
149
y happens or not has no bearing
on whether x happens. So therefore p( y∨x)= p( y)..
Equivalently you could also say that p( x∨ y )= p( x). You can prove this,
150
the equivalent of these two statements very simply, remember that p( x , y) by the product
rule is p( y∨x) p (x).
Now suppose we take this definition of independence. So p( x , y)= p(x) p( y) . And then
equate the two, you immediately get
151
that p( y∨x)= p( y).
So the two statements at least you can derive this from here.
And similarly if you go the reverse direction you can derive the other one from this. So the
two statements are actually equivalent.
152
Typically, we denote independence. The notation is x ⊥ y and ⊥ is a perpendicular
sign, y because we kind of take orthogonal variables as if they were independent variables,
Ok.
So please remember this notation.
153
Now we continue on and look at a slightly more
involved notation that of conditional independence. This is not simple independence. It is

conditional independence. The definition is kind of obvious.
It is the simple extension of the previous definition that we had which is that 2 random
variables are, X and Y are, said to be independent given Z, Ok, given a third variable or third
event, it should be Z
154
if and only if
we say, it is a simple extension as I said,
155
p( x , y∨z)= p(x∨z) p( y∨z )(Refer Slide Time: 06:33)
Let us give a more precise definition like we did last time.
156
It is simply an extension. Instead of simply saying p( x , y∨z)= p(x∨z) p( y∨z ),

p( X=x , Y = y∨Z= z)= p( X=x∨Z=z) p(Y = y∨Z=z)∀ x ∈ X , y ∈ Y , z ∈ Z
157
158
So let us take a few examples in order to clarify this notion of conditional independence. So
here is the first example.
159
So let us take the case where

● X is a dice throw.
● Y is a toss of a coin and
● Z is drawing a card from a deck.
In this case you can see that obviously drawing the card from the deck
has no bearing on X or Y. Actually Z is, if you take pairwise, Y is independent from Z, X is

also independent from Z, Ok. You will also see that automatically p( x , y)= p(x) p( y),
because x and y are also independent
160
as we saw earlier.
The throw of a dice has nothing to do a toss of a coin. So these events X and Y are of course
independent. They are also conditionally independent in the sense that
p( x , y∨z)= p(x∨z) p( y∨z )
x comma y given z is the same as x given z and y given z.
Why? Because since x and z are independent then p( x∨z)= p(x) , p( y∨z)= p( y)
Probability of y given z is
161
And similarly p( x , y∨z)= p(x , y).

And the rest of it follows from the fact that x and y are already independent.
Ok so this is an example where x and y are not only independent but they are also
conditionally independent.
162
Let us take the different example. Let us take the example where
● X denotes height,
● Y denotes vocabulary of this person of whose height we are measuring, Ok.
● Z : age
In this case let us first ask the question, let us ignore the fact of Z, whatever definition I gave
for Z here
and let us look at his height independent of vocabulary.
163
Now a priori it looks like it should not matter what a person's height is, you know the
vocabulary is independent of the person's height.
However if I tell you this person is just 2 feet tall, it automatically means, or it is most
probable that this person therefore must be a child and therefore must have low vocabulary.
So X and Y by themselves, unless
I gave some further conditions are not independent.
So please remember this. In this case X and Y are not really independent variables. They are
only independent suppose I give a particular condition.
164
So let us give such a condition. One such condition would be age. So suppose I say that if I
look at all people of age 12, would it matter what the person's height is for the vocabulary.
And the answer would be, at least common sense says, and our common observation says that
it is not true.
Similarly, if I fix the age at 30, people of the age of 30, regardless of height will have
vocabularies that are whatever they are. They do not at least depend on height.
Similarly, if I fix the age at 2, it will not matter what the person's age is, or the baby's age is,
you know the height and vocabulary would be independent. It could be a slightly taller child
with lesser vocabulary or it could be a shorter child with more vocabulary etc. Ok.
So this is an example of a case where the two variables are not independent but they are
conditionally independent, Ok. So if you give a particular condition they actually become
independent.
So let us look at a third case where the two variables were originally independent but they
actually, after you apply condition, they are no longer independent.
So here is a simple example. I have 2 dice. I throw one dice and the value I denote as X or
that is the event X. The second dice throw, that has value as Y, Ok and now these two events
as I know are independent.
If I have two different independent dice, I throw them; the value I get on one has no bearing
on the value I get on the other, Ok.
But suppose I fix the some of the dice, Ok and that is the variable Z, then if I find out x
comma y given z, the moment I give you the value of x and I give the value of z, the value of
y is fixed. Therefore, these events are no longer conditionally independent, Ok.
So here is a case where the events are independent, but after you add a condition they are no
longer independent, Ok.
165
So conditional independence as you can see is a separate idea from that of independence. You
can have all sorts of combination, independent, conditionally not independent etc, etc. as you
just saw
on the slide.
The notation that we use for conditional independence is x ⊥ y∨z
, that means x independent of y given z, Ok. So this follows naturally from what we saw in
the previous slide.
166
So let us
look at the chain rule of conditional probability.
So remember that the product rule that we had for the joint probability
p( x , y)= p(x∨ y ) p( y ). Ok.
Now
167
we might try and extend this to three variables, the joint probability of three variables. So let
us say you have p( x , y , z)= p( x , a)where a isthe event ( y , z).
A good way to sort of find out the expression for this is to reuse this idea.
168
that means a is the event of y and z occurring together, the joint event y comma z, Ok.
So if we do that,
so we can now write

⇒ p( x , y , z)=p( x , a)= p(x∨a) p(a)
¿ p (x∨a) p ( y , z)
¿ p (x∨a) p ( y∨z) p( z)
¿ p (x∨ y , z) p ( y∨z) p( z)
p( x , y , z)¿ p (z) p( y∨z) p(x∨ y , z)
169
170
171
we can write this as p of z multiplied by p of y given z multiplied by p of x given y and z.

You can see that this is actually a natural interpretation of the
172
probability of all 3 events, x and y and z happening. So what does this say?
This says the probability that x y z occur is the probability that z occurs multiplied by the
probability of y given that z occurs then the probability that x happens given that y and z have
also occurred, Ok.
So this is a simple interpretation of what happens and this is why this is an example of the
chain rule for 3 variables.
Of course now we can extend it to n,

p( x , x ,⋯ , x )= p(x ) p(x ∨x )⋯ p( x ∨x , ⋯ , x
(1) (2) (n) (1) (2) (1) (n) (1) (n−1)
)
173
variables in general.
So, instead of just 3, if you have n events, x 1 through x n which occur jointly you can now
write it as the probability that the first event occurs multiplied by the probability that the
second event occurs given that the first has occurred, so on and so forth until the probability
that the nth event occurs given that the first n minus 1 occur.
In compact notation,
you can write this as the probability the first event occurs multiply by the product, remember
pi denotes
P¿
174
product, product of i equals to 2 to n of this event, x i given that x 1 through x i minus 1

occur,
175
Ok.
This is the chain rule of probability,
Ok.
176
So here is the simple context where we will be using conditional probabilities later. This is
just to give you some, you know look ahead what will be happening.
Let us say you have some such image. This is the image of something called the odd-eyed
cat. It does occur naturally even though it looks like a fake image. You have some eye, cat
with two different eyes. But suppose we ask what is the probability that such and such image
occurs.
Now remember our idea from the linear algebra videos that let us say this is a 60 cross 60
image. It has 3600 pixels. First pixel, you know 6 0th pixel, so on and so forth up till the
177
360 0 pixel.
th
So what I want is basically the probability that all these pixels take the given intensities that
they have taken. So we
can think of this image now as a single event of the first pixel taking the value x 1,
second pixel taking the value x2 and so on
178
and so forth till the360 0th pixel taking the value x3600 , Ok.
And how do we find this out?
179
You can basically think of this as the joint probability and now you can rewrite using the
chain rule that we just wrote, you know what the probability would be.
p( x , ⋯ , x
1 3600 1
)=P( x ) p ¿
Ok.
Now in addition to this, if we somehow got to use independence of, conditional independence
which we will also do later using some kind of machine learning models, you will see that
this whole expression can be simplified tremendously.
180
So this is one context, this is not the only context where we will be using conditional
probabilities and the chain rule but this is one context where we can use this very, very
conveniently.
181
Professor Doctor Balaji Srinivasan
Expectation
In this video we will be looking at a very simple statistical quantity called, or the statistical
function called the expectation. You are familiar with expectation as
the mean or the average, Ok.
182
So the context is this. If you remember, we are dealing with random variables throughout.
Random variables by definition will result in different outcomes.
If I throw a dice right now, sometimes it will give a 1, sometimes it will give a 2, sometimes
it will give a 6. So this is obviously why it is called a random variable in the first place,
Ok.
So which way the random variable actually varies or how it gives different outcomes is
captured by what is known as the distribution if you recall, if you have
183
a discrete distribution such as a dice or a coin or a deck of cards, then we have something
called the probability mass function that tells you how likely each one of these outcomes is.
Similarly, for a continuous variable such as height, weight, temperature, pressure, stress,
strain etc what you have is the probability density function that tells you how likely a range
of probability is, a range of values is. So the probability that height varies between 5.6 and
5.7 is something, Ok. So that would be a probability density function.
184
Now once you are given the distribution we start using some overall ideas, you know. It is a
random variable. It has any number of, large number of values but you want to give some
summary statistics, Ok. So you want to give some qualitative and quantitative picture of what
the random variable is doing.
And the most common ones that we will be using at least as far as this course is concerned
are two quantities called the expectation and the variance, Ok. The expectation is something
that you are already familiar with. We usually call it the mean or the average. But in the
context of random variables typically we tend to call it expectation, Ok.
185
Now the expectation gives the mean average or expected value of the random variable
once you know the distribution, Ok, so that is important. You need to know what the
distribution of the random variable is. Then you can find out the
186
expectation.
So here are the couple of examples.

● You can say you know I have invested in the stock market. What are my expected
returns? Obviously you know the returns are not fixed. It is the random variable. But
nonetheless, given the certain investment what is my expected return from the stock
market?
Another example is
● I know that monsoon is going to hit. What is going to be the expected rainfall during
the coming monsoon? Ok. So you could ask questions of that sort. Again this is a
187
random variable but overall you would still like to know what is going to be my
expected crop yield, a farmer might be interested in knowing.
So here, is how we define expectation, Ok. This is the average value.
So right now I am not talking about only expectation of a random variable x, but that of a
function of x. So the expectation or the expected value, both these terminologies are used for
some function x of a random variable where x is a random variable,
188
is the average value of f ( x) when x is drawn from a probability distribution P.
Recall the notation x is drawn from P is written as x ∼ p.
189
So the denotation is, how the notation that we use

E x ∼ P [ f (x)]
drawn from P, Ok so this is the notation that we will be using. This is the most detailed or
rigorous notation,
190
Ok.
But more often than not,
we use some shortcuts. If you know which probability distribution we are talking about, we
simply say E x [ f ( x)].
191
Even more simply if you also know what x is, you know which random variable x we are
considering you can simply say E[ f (x)]
and sometimes you simply say E[ f ], Ok. So all these notations are used as far as expectation
is concerned.
192
So mathematically how do we calculate expectation?
For a discrete
193
variable it is
❑
Discrete : E x ∼ P [ f (x)]=∑ P( x) f ( x), P( x): PMF
x
Ok and if it is a continuous
❑
Continuous : E x ∼ P [ f ( x)]=∫ P( x)f ( x), P( x): PDF
x
variable it
194
is, remember now this is a probability mass function.
In this case it is a probability density function which is why
195
❑
you have to multiply by dx in order to get the probability, ∫ P( x) f ( x) of x integrated over all
x
possible values. Now we will see a couple of
very, very trivial and simple examples on
196
the next slide.
So a generalization of expectation which we just saw expectation of a single variable. But

usually especially within machine learning we are dealing with vectors, Ok. So this is called
multivariate expectation where basically x is now a vector consisting of ¿ .
197
Ok.
As we saw in the previous video this could be a image, this could be temperature, pressure,
velocity etc. It could be any number of variables, Ok. So if you have that then you can
consider each component separately.
198
And all you will do is expectation of variable 1, expectation of variable 2, so on so forth, the
reason why I wrote it here is notice
that the variable itself, the first one is expectation over variable 1, Ok.
So if I want, let us say my x vector is temperature, pressure, humidity
199
then I will take expectation over all possible values of temperature of, if you are interested in
some function of temperature, so on and so forth, expectation over all possible values of
pressure, function of pressure, Ok.
So that is multivariate expectation. Multivariate simply means multiple variables, Ok. It is not
a single scalar, it is a vector.
200
So here are some trivial examples. I am going to do univariate examples here
Ok.
So all of us know, let us say you want to know the expected value of a toss of a coin, for a
fair coin assuming that heads has a value 1 and tails has a value 0,
201
then I will just do it in detail so that you get used to this kind of calculation if you are not
already used to it.
So you first identify the random variable you are considering. Here the
random variable is the result of the toss of the coin and it is X ∈ {0 ,1},
Ok. What is P, you now need to know what the distribution P is. So
● first identify random variable.
202
Next calculate the

● Probability distribution. In this case it is a mass distribution
because it is a discrete random variable.
Now probability distribution is very simple. If I have x andP( x) then x takes the value 0 with
the probability half, x takes the value 1
203
also with the probability half.
So if you find out the expectation, it is simply

❑
E x ∼ P [ x ]=∑ xP ( x)=[0 ×(1/2)+1(1/2)]=1/2.
x
204
All of us know this. Another way to look at it is the average value of the toss that you will
obtain is basically going to be half. So notice that the expectation, even though we have given
it as the average value, or the expected value, obviously you cannot say that the expected
value of a toss of a coin is half.
Because neither heads nor tails is the actual half. It just represents an average or a weighted
average of values that come out, Ok.
Similarly, if you want to find out the expected value of a fair dice
205
throw this is going to be the average of 1, 2, 3, 4, 5, 6.
Given that all of these are equally probable, assuming this is a fair dice and it is not like sort
of a loaded dice or something, again from the same idea you get
❑
E x ∼ P [ x ]=∑ xP ( x)=[1×(1/6)+2(1/6)+⋯+6(1/6)]=3.5.
x
206
Let us look at a slightly more complex
example. This is just to see, you know may be a slight increase over simple averages. Ok so
what is the expected value of the sum of two dice thrown together?
Now the random variable here is
x ∈ {2,3 ,⋯ ,12 }
207
obviously 1 cannot occur if you are throwing 2 dice and taking the sum.
So you have the variable that goes between 2 and 12.
The probability distribution you have to be a little careful now. Ok. Now notice that the
probability of 2 is not the same as the probability of 3,
208
Ok, unlike the previous case where we had uniform probability distributions; in this case each
of these probabilities is different, Ok.
So here is the distribution.
Both 2 and 12, if x is 2 as well as x is 12, both of these can occur in only one way, and so for
2, you need 1, 1 and for 12 you need 6, 6 and that can occur in 1 by 6 multiplied by 1 by 6
which is 1 by 36.
209
Similarly, 3 can occur
in two ways which is 1 comma 2, 2 comma 1, Ok so therefore you get 2 by 36. Similarly, for
11. For 4 you have three ways, for 5 you have four ways, for 6 you have five ways, you know
1, 5; 2,4; 3,3 and 5,1 and 4,2, that put together. 7 can actually occur in six different ways and
so these are the probabilities.
So notice that unlike the previous examples that we took, in this case, P( x)is an actual non-
uniform distribution.
So if you find out expectation,
210
❑
E x ∼ P [ x ]=∑ xP ( x)=[2×(1/36)+3(1/36)+⋯+12 (1/36)]=7
x
But the calculation is a little bit lengthy, Ok. So the question is,
is there an easier way of calculating this case?
So for this we use a
simple idea called the linearity of expectation,

211
Ok. This is an extremely important property of expectations.
The idea is that

212
● The expectation operator, so this thing, this is a linear operator.
What do I mean by linear?

● Mathematically, if f ( x)=α g(x)+ β h(x ) is a multivariate function with α , β ∈ Rbeing
scalars, then
E[ f ]=α E[ g]+ β E[h]
That is if you have f which is a linear combination, please remember linear combination from
our discussion, if f is a linear combination
213
of two other functions g and h, alpha and beta let us say are scalars.
then expectation of f can be written as alpha times expectation of g plus beta times
expectation of h, Ok.
So this is an important property. We will just prove it in the next slide,
214
Ok. Also notice that I have used a compact notation here instead of writing expectation x
tilde p f of x etc, I simply used expectation of f is alpha times E g plus beta times E h.
So you can apply this. I will prove this shortly but before that let us simply apply this to our
two dice case. Remember that in order to find out the expectation of the two dice we actually
had to find out first the probability distribution of each of those occurrences, different
outcomes and then we had to do the expectation calculation, Ok.
But suppose we notice that the two dice are essentially two different random variables
coming together, one is D 1and one is D 2, where D 1 is the value that you got out of the first
dice, and D 2 is the value that you got out of the second dice, Ok.
215
So through our linearity we can write

E[ x ]=E[ D 1 ]+ E[ D 2 ]=3.5+3.5=7
we know this already because the expected value from one dice is 3.5,
216
the expected value from the other dice is also 3.5 so the expectations add up and you can see
that this is a remarkably simple calculation,
Ok.
This is a much, much simpler calculation compared to actually finding out the overall, you
know probability of, probability distribution of x.
So that is the advantage of using linearity of expectations
217
and this is a very commonly used property. Let me give a very quick proof.
Linearity : if f ( x)=α g(x)+ β h(x)is a multivariate function with
α , β ∈ R being ,then
E[ f ]=α E[ g]+ β E[h]: claim
❑
E[ f ]=∫ f ( x) p( x)dx : definition
❑
❑
¿∫ (α g(x)+ β h(x)) p( x)dx
❑
❑ ❑
¿ α ∫ g(x) p( x)dx+ β ∫ h( x) p(x) dx
❑ ❑
⇒ E [ F]=α E [g]+ β E[ h]
218
219
220
221
222
You can prove linearity property of expectation for a discrete random variable similarly. I
suggest that you try this as an exercise. Thank you.
223
Variance Covariance
In the previous video we had looked at the idea of Expectation. In this video we will be
looking at some additional ideas which are called variance and covariance. Again you will be
familiar with some of these
from school and some of these might be unfamiliar.
So let us continue
224
our discussion from where we had our expectation discussions. Once again the same
idea continues from before. If you have a random variable it is going to vary, Ok. And its
variation is actually captured by its distribution. What expectation does is, on an average
what can you expect. Ok this is what it will tell you.
225
But what we want to know is variance. Variance is, Ok, it is given that this is the expected
value. For example expectation for dice throw is 3.5 but how much more can it go, how much
less will it go? What is the variation from the expected value? What is the variation from the
mean, from the expectation? This is what variance talks about,
Ok.
Variance also measures; in many cases you can also measure how much does the quantity
fluctuate? Ok. So it is possible, entirely possible for two distributions to have the entirely
same expectation but different variances.
226
For example if I take the value 13, 13, 13 or, Ok
so let us say there is some random value which takes the value 13, 13, 13. In this case the
expectation is 13. Another one
is 12, 13, 14. In these cases is the same expectation but
227
different variance, Ok.
We see this one varies
228
more from the mean and this one in fact does not vary
from the mean at all and we actually get variance is equal to 0. So
229
once again some practical example, if we go back to the investment example, expected
returns is one thing but you would also like to know how much can change, Ok
So your expected returns might be x y z amount of Rupees, but your actual variation, you
could actually have a profit or you could actually have a loss, even though your expected
value might be a profit, Ok.
So some times and in fact lot of times in modern, what is called, modern portfolio theory
usually the variance is what is used to find out what is the risk in investment.
230
Though this is a little bit controversial but anyway, many people do use variance as a
measure of risk.
So similarly I gave the example of the expected value of rainfall during the coming monsoon.
But you could also look at the variance in rainfall during coming monsoon.
Not only do you want to know what the rainfall is but you also want to know what is the
maximum it can go, you know in case you have to plan for a flood, Ok. Or what is the least
that it can go in case you have to plan for drought, Ok. So variance measures these ideas.
So let us look
at mathematically what variance is. Once again we will start with a univariate case.
Remember univariate simply means scalar. So if I take a single value x or a single random
variable x which is a scalar, you look at two sorts of quantities, variance and standard
deviation.
Standard deviation is usually denoted by σ
231
and it is usually equal to, not usually, it is defined as being equal to square root of variance.
So what variance measures is how much does the value of f vary, Ok, vary from its expected
value, Ok when x is drawn from P? Once again the notation,
232
if you want to be very precise you will say

V x ∼ P [ f (x)].
Ok.
Once again if P is clear
233
from the context we drop P and simply say V x [ f ( x)] . If x is also clear we say V [ f (x)].
and usually, and this is unlike the expectation, this is the usual notation which is used, which
is either V [ f ]∨Var [ f ], Var stands for variance.
234
Mathematically
V [ f (x)]= E¿
¿ E¿
variance of f ( x)is the variation or the difference between f ( x) and the expected value of
f ( x),
235
Ok. So this, I will, we use the notation
just for simplification E[ x ]= x̄ and E[ f ]=¯f
236
So what is variance of f, it is E ¿,
that is the variance from the mean square. So this is
237
mean square value. So variance is essentially mean square.
We can also define the
standard deviation. Standard deviation σ =√ ❑. So standard deviation is root mean square,

Ok. Variance is mean square;
238
standard deviation is root mean square.
Now there is another idea, some of you might actually be unfamiliar with this idea. This is
the idea of covariance. Now covariance is something we are interested in, if instead of having
one variable x, you now have two variables, Ok. When I say
239
univariate what it means is each of x and y are individually univariate we look at a

multivariate case a little bit later, Ok.
So notice that when we defined variance of x this was

V [ x ]=E ¿
So this idea can now be generalized to a pair of variables, Ok. This I can call self covariance.
The meaning will become
240
shortly clear as we go through the next couple of slides, Ok.
So suppose instead of one variable, you have two variables, we simply generalize this idea
and we define covariance of two variables x and y as
241
Cov [ x , y ]=E[(x− x̄)( y− ȳ )]=E[(x−E[ x ])( y−E[ y])](Refer Slide Time: 07:17)
What does that mean? This says that how much does x vary from its mean when y varies
from its own mean, Ok. So this is what is called covariance. That is how much do x and y
vary together, Ok.
Now just like we defined it, again I use x bar just for compact notation, just like we can
define covariance of x and y, you can also define covariance of f ( x) and g( y),
Cov [f ( x), g( y)]=E [(f (x )−E [f ( x)])(g( y)−E[ g( y)])]
is replace
242
243
and you will get the corresponding covariance functions for f ( x)∧g (x)
244
Now closely related quantity to covariance is something called the correlation.

Cov [ x , y ]
corr [ x , y]=
√❑
245
So essentially correlation is normalized, Ok.
What does normalized mean? Normalized means, you know just like we normalize the vector
and make it into unit normal, similarly you are normalizing covariance so that its size stays
between certain limits. So as it turns out, sorry correlation will always lie between[−1 , 1]
246
Now we had looked at the
ideas of variance and covariance.
247
What does correlation indicate? Correlation indicates how linearly correlated, the word
linearly will become clear in the next slide, how linearly correlated the two variables, the two
random variables are.
For example let us say x is height and y is weight, Ok. What you expect is, as height
increases weight also increases. Of course there is randomness here. There are tall people
who can be very, very, very slim and there can be short people who could actually have
higher weight than the taller person but nonetheless you can see that overall the trend will be
that as x increases y also increases, Ok.
Notice that by definition,

Cov [ x , x ]=Var [ x ]∧corr [ x , x ]=1
248
So let us now in this slide look at an interpretation of what covariance and correlation mean. I
have already given you a slight
249
indication. So let us say this direction here is x and here it is y.
And let us say x and y are random variables. And you see this kind of variation amongst
them, Ok.
You can see that even though both are random, as x increases, y also increases. That is at
least the overall trend, Ok. Now if I calculate the covariance in this case it comes to some
27.23 .
250
Now the general rule of, not rule of thumb, the general rule is that if you have positive
covariance then it means that as x increases y is also expected to increase, even though in a
certain few cases it can happen that, you know at this given x, you have a y and at a slightly
higher x, you have lower y.
251
Even though that expectation could be violated, overall the trend is that as x increases, y
increases. And that is what is indicated by covariance, the value of covariance, Ok or the sign
of covariance.
252
Now you can see a counter case. In this case what you notice is as x increases, y decreases.
And sure enough the covariance in this case is negative, Ok. It is negative 28.09. And what
negative covariance indicates is that as x increases, y actually decreases.
253
Let us look at the third case. You will see that you really cannot say anything about any trend.
It seems like x and y have no relation whatsoever. If you find out the covariance in this case,
it still gives the positive value. It is 8.5, so 8.53. But we would like to qualitatively
distinguish between these two cases, Ok.
Here covariance is positive but you see a clear trend. Here covariance is positive even though
it is a smaller value, we can notice that it is a smaller value but it is all over the place, Ok. So
can we relate these two? It turns out we can by looking at the correlation.
254
So notice that the correlation in this case is point 0.97. This is what normalization means.
Correlation is simply the numerator. It tells you what the trend is. But how positive is
positive? We will not know unless we compare it with something and the comparison metric
here is the variance of these two quantities.
255
So if you find out the correlation, that is normalize the covariance divided byσ x ,σ y , then you
get the correlation of point 0.97 which is very, very close to 1. Remember that if I had simply
taken a linear relationship x versus x, I would have got 1.
So correlation which is close to 1 means that there is a strong positively correlated, linearly
positively correlated, Ok strong positively linear correlation between the two variables.
256
Similarly if I find out the correlation in this case, I find out that it is -0.98. And it is strongly
negatively correlated linearly. Now what we would expect is, in a case of this sort our
correlation should be very low, Ok even though covariance is positive. And indeed this is the
case. So correlation in this case is point 0.14 which is much smaller in comparison to point
0.97.
257
In fact typical rule of thumb, if you take correlation below 0.3 or even 0.5, you might as well
assume that the variables are not correlated. Of course the lower the correlation gets, the
lesser actually the relationship between the two variables,
Ok.
So correlation which is close to 0 means there is no real correlation between the two
variables.
258
So let us look at some mathematical simplification of covariance which we will be using

multiple times through this course.
So recall that covariance of x , y is

Cov [ x , y ]=E[(x− x̄)( y− ȳ )]=E[(x−E[ x ])( y−E[ y])]
You can simplify this. This is usually useful for calculations, that is.
Cov [ x , y ]= xy−
¯ x̄ ȳ=E[ xy ]−E[ x ] E [ y]
259
260
So let me quickly prove this. We will use the fact that the expectation of expectation or the
average of an average is simply the same as the average because the average is one single
value, Ok. At least it is useful to think of it that way. So if I have already taken average of 10
values I have got one simple number. If I take average above that, it does not make any
difference whatsoever. Ok so
¯x̄ =x i . e E[ E [ x]]=E[ x ]
So remember that
Cov [ x , y ]=E[(x− x̄)( y− ȳ )]=E[ xy − x̄ y−x ȳ + x̄ ȳ ]
¿ E[ xy ]−E[ x̄ y]−E [ x ȳ ]+ E[ x̄ ȳ]
¿ E[ xy ]− x̄ E [ y]−E [ x] ȳ+ E[ x ] E[ y]
¿ E[ xy ]−E[ x ] E [ y]
261
262
263
264
There is a special case of this which we tend to use, which is
2
Cov [ x , x ]=Var [ x ]=E[ x ]−E ¿
265
266
So this is also a definition that we are, or equivalence that we will often use.
Let us come now; there is a certain point of covariance and independence that I wish to
discuss.
267
So we already saw that when x and y are two independent random variables then covariance
of x and y is actually 0. So if x varies in one way and y varies in a completely independent
way you will actually get Cov [ x , y ]=0.
You can actually prove it from the previous slides results also but I will not do that. I will just
appeal to your intuition, Ok.
268
So if you have, you know sort of random spread of points of this sort, then cCov [ x , y ]=0.
However if you want to go the other way, Ok which is you want to say that if Cov [ x , y ]=0, x
and y are independent
this is not true, Ok. So Cov [ x , y ]=0need not necessarily mean that x and y are independent
of each other.
269
Let us see an example, ok. So let us say x is a random variable withE[ x ]=0and also E[ x 3 ]=0
, Ok. This need not always necessarily be the case, but you can easily generate a set of
random numbers for which this is true, Ok. For example if I take x is a random variable, a
uniform random variable with values going from -10 to 10, these are uniformly spaced and all
values are equally possible then E[ x ]=0and also E[ x 3 ]=0 as you can quite easily see, Ok.
Now let us take y=x 2 Ok. Now y is another random variable.

270
So if x is a random variable and y=x 2 you know you will find the distribution of x and y like
this, Ok. Now clearly x and y are not independent, Ok. Obviously the value of y depends on
the value of x.
However if we try and find out

3 2
Cov [ x , y ]=E[ xy]−E [ x] E [ y ]=E [ x ]−E [ x] E [ x ]=0
271
By construction E[ x 3 ]=0∧E [ x]=0.So the covariance actually is 0.
272
So for this case even though you can see a nice relationship between the two variables,
covariance is actually 0, Ok. So that is, the covariance is 0 even though the variables are not
independent of each other, Ok.So what turns out is covariance is 0 only means that there is no
linear relationship, Ok.
273
So this can get a little bit confusing. So let me summarize this. In case two variables are
independent, Ok in this direction, so if independence is there, then there is zero covariance.
But it does not necessarily mean that if there is zero covariance that there is independence.
274
If there is zero covariance, all you can say is that they are not linearly dependent on each
other, Ok but you cannot talk about general independence, Ok. So covariance basically
measures linear dependence of one quantity on the other.
The final idea that I would like to discuss in this video is that of a covariance matrix. So, so
far we have been looking at univariate x.
275
Now let us consider case such as an image where x ∈ R n is a vector, Ok. It could be not just
an image, of course I will keep on using that example because that is the most intuitive
example and we will be using it very often in this course. But it could be anything, Ok.So not
only would you like to know, you know how each, what is the probability of each pixel is,
but you would also like to know what is the joint probability of, let us say one pixel being
white and the other pixel being black which is usually how you can characterize images. Or if
you look at an input vector such as temperature, pressure, humidity, you know what is the
probability that the temperature is so much and the pressure is something else, the joint
probability, stuff like that. And how much does temperature vary from the mean given that
humidity has varied from its mean. So that is usually how the covariance matrix is used, Ok.
276
n
if x ∈ R , Cov ¿
So we define a pairwise correlation or pairwise covariance of each of the variables,
temperature with pressure, temperature with humidity, pressure with humidity etc and find
out and also find temperature with temperature which is simply a variance etc.
And that is how you define a covariance matrix.
277
The covariance matrix is, even though it looks big, it is actually very simple to define.
So if you have x ∈ R n , x=¿.
You find pairwise covariances. So the first entry of this matrix is Cov [ x1 , x 1 ] which is
obviously, this is same as variance of x 1.
278
The second entry here is Cov [ x1 , x 2 ], so on and so forth. So covariance in general of

Cov [ xi , x j ], Ok. This matrix will have size n ×n and the diagonal entries of the matrix are
simply the variances of the individual components.
279
So,
Ok so this one will be variance of x2 , this is variance of x n. So the covariance matrix is
something that we will be using quite often. Sometimes it is denoted by Σ.
280
281
So we will be seeing this in greater detail in future videos. Thank you.
282
Some Relations for Expectation and Covariance (Slightly Advanced)
In this video we will be looking at some more relations for expectation and covariance. These
are slightly advanced relations. We will be using these only rarely in the course. So in case
you do not understand this portion that is Ok, you will get to understand it a little better as the
course progresses. So please do not panic in case it looks a little bit unfamiliar to you.
283
So we had looked at the variance for a single vector, variance of x vector. Now we are going
to look at the covariance of two vectors.
Cov [f ( x), g( y)]=E ¿ ¿
So remember that for scalar functions f ( x)∧g ( y )we had defined the covariance as the
deviation from the expectation of f ( x) multiplied by the deviation from the expectation of
g( y)and expectation of the whole thing, Ok.
284
We also, you might also remember that the simpler definition

Cov [ x , y ]=xy− x̄ ȳ=E[ xy ]−E[ x ][ y ]
Ok. This can be called as the variance. Remember just like Cov [ x , x ]=Var [ x ]which is the
square of the standard deviation, Ok.
285
Now we had defined something called the covariance matrix.

Cov ¿
This was a matrix of all covariances, you can recollect this from the previous video. This is
simply a matrix of covariance of, let us say the first element will be Cov [ x1 , x 1 ], so on and so
forth. Remember x is a vector and xi is the ithcomponent of the vector x.
286
Now let us say we denote byμ, this is standard notation, mean or expectation is denoted by
mu. Let us say μ is the expectation of the vector x and since it is a vector, μ is also going to
be a vector. Again if you look at the previous videos you will see that the first element of μ1
is the first element of the expectation so on and so forth. That is a full vector. So let us sayμ is
the expectation of the random vector x then
Cov [ x , x ]=E ¿
287
288
Why are we taking a transpose? What is the size of this?
This is a m× 1 vector. Suppose I take a transpose, this is 1× m. Therefore what you will get
m× m.
289
290
Remember this, if x is a scalar this simply comes to Cov [ x , x ]=Var [ x ]=E[ x 2 ]−E ¿
This we have seen before also. As I said before we used the notation that
Var [ x]=Cov [ x , x ]
291
Now similarly, just very, very similar to this idea, we can now define Cov [ x , y ]. Now let us
say x ∈ R m and y ∈ R n which means that x ∈ R m ×1 vector, y ∈ R n × 1vector. So y T ∈ R 1× n❑, Ok.
292
293
T
Cov [ x , y ]=E[ x y ]−E[ x ] E ¿
You can show easily that Cov [ x , y ]=Cov ¿. I would suggest that you try this as a exercise.
294
Now another idea that we are going to look at is the sums of two random variables.
Remember that we had already discussed that expectation is a linear operator so that
E¿
Now we are going to extend this idea to that of two random variables x , y ∈ R nthen
E[ x + y]=E [ x]+ E [ y ]
E[ α x+ β y ]=α E[ x ]+ β E [ y ]
Variances are a bit more involved.
Var [ x+ y ]=Var [ x]+Var [ y]+Cov[ x , y ]+Cov [ y , x ]
2
Var [α x ]=α Var [ x]
295
296
297
298
Note that, if x,y are independent then

Var [ x+ y ]=Var [ x]+Var [ y]
Now notice that if x and y are independent, then
Cov [ x , y ]=0∧Cov [ y , x]=0
299
So this gives only under the condition that x and y are independent, do we get that variance of
x plus y is variance of x plus variance of y, Ok.
300
Now as an exercise please think about what happens to Var [ x− y ].
Var [ x− y ]=Var [ x]+Var [ y ]−Cov [ x , y ]−Cov[ y , x ]
Again I would suggest this as a quick exercise for you to try out, Ok.If you ignore the minus
parts, minus covariance x y etc, notice that even if you have the difference of two variables,
the errors will still add, Ok. Variance is like the error, Ok. It is like the variation.
You would have seen this in simple, you know experimental measurements perhaps a little
bit before that if you have two variables, you know I make one length minus another length it
does not mean the error subtract, the error still add because the errors go as the variance, Ok.
So therefore the errors add here.
301
The final idea that I would like to discuss in this video is that of an affine transformation. An
affine transformation is nothing but another name for a linear transformation.
So let us say you have two random variables or two variables x and y. x is a vector, y is a
vector. A is a matrix.
302
You can transform, y can be seen as x with a linear transform, b is also a vector here of
course in order for our dimensions to match, Ok.
So x is a vector, b is a vector, A is a matrix and y is another vector. So y= Ax +b. Now if I
have this, suppose I know the mean of x, or suppose I know the variance of x, can I find out
the mean and variance of y? It turns out that there are mathematical relationships.
Expectation works out as we expect, Ok.
303
So, we will write the expressions here.

E y [ y ]=E x [ Ax +b]= A E x [ x]+b= A μ+b
Please notice this, the subscript. This expectation is over the variable y. This expectation is
over the variable x. So E y [ y ]=E x [ Ax +b] which since, expectation is linear we simply take
out A which is a constant matrix, outside. A E x [ x]+b, μ of course is the notation for
expectation of E x [ x ]So this is quite simple as far the expectation is concerned.
304
Now variance, we will go through this slowly is once again slightly more involved,
V y =V x [ Ax +b]
¿ V x [ Ax ]
¿ Cov[ Ax , Ax ]
¿ E¿
T T
¿ E[ Ax x A ]− AE [ x] E¿
¿A¿
¿ ACov ¿
Ok. Let us look at it step by step.
305
Variance of [ Ax+ b] is same as variance of x because variance of something x plus a constant

is the same as variance of x.
Why is that? You can prove this very easily mathematically but let us just look at it very
simply from a physical point of view.
Variance measures the difference from the mean, Ok or the distance from the mean. If I
change the variable by a constant the mean will also go up by a constant and that constant
subtracts out.
306
We are only looking at the difference from the mean and we are not looking at the actual
value of the variable, Ok. Since that is the case so whether I add b or not, I am going to get
the same variance, Ok. Now variance of Ax was defined as Cov [ Ax , Ax ].
307
T
Cov [ y , y]=E [ y y ]−E[ y ] E ¿
E¿
That is what is used here. Now I can bring out A from the front end and bring out A transpose
from the back end, Ok. So we can do that. So A from the front end, A transpose from the
back end and we can take that out common and we get A ¿. And A transpose comes out.
So you can now rewrite it. This is nothing but Cov [ x , x ]which we discussed in the previous
slide. So ACov [ x , x ] A T . Typically the covariance matrix is denoted byΣ just like, μ is the
expectation of x, Σ is the variance of x which is the same as Cov [ x , x ].
308
309
So we have looked at a lot of mathematical relations in this video. As I said earlier you might
or might not be comfortable with it. Even if you are not comfortable with it, we do not use at
least this set of expressions too often. But please get comfortable by watching this a few
times in case you found it unfamiliar. We will use it a little bit towards the latter half of the
course, thank you.
310
Machine learning for engineering and science applications
Professor Balaji Srinivas
Department of Computer Mechanical Engineering
Machine Representation of Numbers, Overflow, Underflow, Condition Number
In this video we will be looking at how your machine how your computer represents numbers
okay and a few phenomena which can go wrong when we think that the way the computer
processes numbers is the way we do things on paper intuitively okay. So this idea is
something called overflow and underflow, we will also look at another idea called condition
number, most of the examples not the slides but just the examples that we have taken is from
a good book, an introductory book for numerical methods by Steven Chapra.
311
So let us look at the idea of machine arithmetic, so in the previous slides when we were doing
optimisation, we were doing it theoretically so if you want to find out the minimum of a
function, you will simply say that the gradient at the minimum or at the optimum is 0. Now,
in order to do this if you were to do it on paper or if you do it using symbols, you will assume
usually real number arithmetic okay, you will assume that you can calculate digits to as much
precision as you want. We will also assume you know you will say that you know if I
differentiate X square with respect to X, I am going to get to X but in practice remember we
do not deal with symbols, we infect do not even deal with images, as I have said multiple
times so far, we actually deal with only Numbers okay and specifically numbers of finite
precision as you will see, this will start making sense as you go a little bit further.
This kind of arithmetic is called finite precision arithmetic or machine arithmetic ok. In some
cases you can call it “Floating point arithmetic” which is the most common as far as we are
concern in the special case of floating point numbers. Floating point number means of, the
numbers where we deal with the real numbers rather than with integers. Now, the fact that we
have only a finite precision can actually have surprisingly important and sometimes
surprisingly catastrophic consequences okay, so let us take one such recent example ok.
312
So the example is that of Ariane 5, this is a European launch vehicle, it is the very 1 st test in
the Ariane 5 configuration, when there were of course Ariane... 1 to 4… was on June 4 th 1996
okay so the launch seemed normal until the first 37 seconds okay. After that dramatically I
would recommend that you take a look at the video on YouTube or something if you simply
put Ariane 5 you know launch or something, you will get this video.
So if you take a look at what happens, at approximately 37 seconds after launch the rocket
suddenly turned by 90 degrees incorrectly this was not planned of course. The boosters were
ripped apart and the vehicle basically it had a self-destruction instructions sitting there and it
self-destructed automatically so it is a giant loss approximately you know estimates vary
between 350 million to 500 million US dollars, it is perhaps one of the most expensive
problems that was caused due to software failure ok so simple software failure cause this.
And what really happened if you dig into it was most of it was ignorance or not ignorance
really people did not really adequately take care for the fact that we are doing finite precision
arithmetic ok rather than real arithmetic in some sense okay, you will see how that happened
little bit later.
313
So let us look at this, machine has a finite number of bits okay, unlike you know how a
human being writes, if you require more precision you simply keep on adding digits. A
machine has finite number of predetermined number of digits ok, you can think of these
digits you know whenever you store a number you can think of them as individual boxes and
in each box either 0 or 1 will be stored. As you know for most part all of our arithmetic is
done in binary basically to 0 or 1 system, every single thing is actually represented in terms
of zeros and ones that is both the power as well as you can sometimes see there can be a
problem.
So let us take a simple integer okay, so if you have an integer like 173 which is what we
would call it in base 10, you can now write it in binary, you would have all done this in
school, you will have this long representation because it can be written as 2 power 0, this is of
course the representation for 2 power 0, 2 power 1 there is no representation + 2 power 2, 2
power 3, 2 power 4 has no representation, 2 power 5 + 2 power 7 can be written as 173 ok.
So as far as the machine is concerned, it is going to look like stuff like this, so you are going
to have about 8 digits here ok and you will have some 1s and some 0s and each box can either
store a 0 or it can store a 1 okay.
Now suppose instead of 173 you have something like - 173, now what are you going to do?
what we do in terms of representation in a machine is to use something called a sign bit ok,
the sign bit will be another box upfront here, though we will see how we actually do it so this
if it is 1, the machine will interpret it as negative, if it is 0 it will assume that the integer is
314
positive ok. Now I had 8 boxes here but let us say I have a 16-bit machine or a 16-bit
representation, 16-bit simply means I have 16 boxes now ok.
So you will have something of this sort, remember the very 1 st one the leftmost bit is what is
called the sign bit, if it is 1 it basically means it is negative, the rest of it essentially represents
the magnitude, in this case I have just copied this from there to here. So this number will be
interpreted as -173, the - comes from here, 173 comes because all these are 0 and this is 173
ok. Now this has an implication, implication is that there is a maximum number that you can
represent on the machine okay. That is because if you run out of digits or run out of boxes to
store your number, you can no longer represent a large number, this is somewhat similar to
calculators okay.
So if you have a calculator with 8 digits, you cannot store a number which is greater than 8
digits of course we will account for exponents a little bit later even in this video but the main
point that I am going to make here and if you get nothing else in this particular video, please
take away this one single point that there is a maximum number that the machine can store
accurately and there is also a minimum number that the machine can store accurately ok. So
in this case if you see the maximum minimum for 16-bit okay, remember one of the bits has
been used here for sign so you have only 15 less so you can represent 2 power 15 - 1 which
comes to + - 32000 something ok. So similarly if I increase from 16 bits to 32 bits, I will get
to power 32 - 1 okay or 2 power 31 - 1 etc.
315
Now, this was about integers, you have similar representation for floating point, once again
you are free to skip these lines as long as you understand particularly that there is a minimum
and maximum number on the computer, even a floating point number that you can use on the
computer ok. So real numbers can also be represented in binary so let us say you have the
number 5.5, it can be written as 4 + 1 + 0.5, 0.5 of course is 2 power - 1, so you will write it
as so you notice this representation, 101.1 because after point what comes is the negative
digits which is similar to how we deal with decimals because after the decimal whatever
comes here suppose we have 0.3, this is 3 into 10 power - 1 in the base 10 representation, and
one digit after that would be you know 10 power - 2, similarly here too.
Now we have a more compact notation which we usually call the scientific notation even in
calculators, so instead of simply writing it in terms of decimal places you can actually get
more numbers if you present it this way, + - some numbers times the base power and
exponent okay. So S is what is called a significant which contains all the significant digits of
the number, these the base we are using which is let us say if we are using base 10 it is 10, if
it is 2 if it is a binary digit then it is base 2 and E is the exponent that we are using. So for
example, if you have the number 0.001234, you would write it as 1.234 10 power – 3, where
1.234 is S, 10 is the base and exponent is - 3 okay.
Now it turns out that in binary you can get as I talked last time that if there is a maximum
minimum number on the computer, you would like to increase this maximum minimum as
much as possible. For binary number the first digit will always be 1 okay because if it is 0,
we ignore it, we are only going to look at significant digits starting from 1 so we remove that
one away and instead of S we write it as 1 + F into 2 to the power E and this gives you a little
bit of extra numbers to store okay. So for example, the same 5.5 if I write it as 101.1 in
binary, this will be 1.011 into 2 to the power – 2 and this can be written this should be 2 to
the power + 2 I am sorry okay. So this is 2 to the power + 2, so this can be written as 1 + F,
this is the base and this is the exponent close 2.
F is called the mantissa and E is of course the exponent, now note that the numbers that we
have given whether it is F or whether it is E now needs separate bins, so you have to store
this 0.011, you also have to store these two into separate bins ok. So for a 64-bit storage
scheme which is fairly standard for what is called double precision, we store digits this way.
You keep one bit for the sign, you keep 11 bits for the sign exponent that is this E ok and you
keep 52 bits for the mantissa which is for F okay. Now this is what is called an IEEE
316
standard, there is a standardised way of storing this, you know you can make other choices
but this is the standard that people have agreed to on how to take 64 bits and store floating
point numbers okay.
So let us look at double precision, double precision is standard precision used for real number
data, so if you use Matlab this is the default, in other cases let us say C, C++, etc. you have
two options, there is something called float and there is something called double precision.
Once again for most scientific computations we tend to use double precision to be as accurate
as possible, you will find this within GPOs also single precision versus double precision ok.
So remember that for 64-bit we had already seen the 1, 11, 52 split, this was for the sign bit,
this is for exponent again signed exponent remember the exponent by itself can have signs
okay and this is for the mantissa okay which was F.
Once again, since there are only limited boxes for storing the exponent, there is once again a
maximum as well as a minimum positive number that can be represented ok, remember we
have now 11 bits for the signed exponent so we have to remove one bit for the sign and you
will get 2 power 10 is 1024, so you will have from 1023 to - 1022 that is the range within
which you can represent the exponents, remember we are only talking about exponents in this
particular video.
So the largest number that you can represent is let us say I have 52 digits here, so I take
1.1111 this is in binary and I can go to 2 power 1024 that is the maximum you can represent
within double precision 64-bit okay. You go above this using double precision, any computer
317
that tries to use double precision will either give NAN which is called not a number or it will
give INF which is infinity, so depending on what the compiler is like.
Similarly, you have a smallest number; this is once again in the smallest positive number, just
above 0 what is the smallest number you can get? 1.000 into 2 to the power - 1022, so this is
approximately 2.2 into 10 to the power -308 okay, so this seems like a very wide range but
sometimes you can actually go beyond this very easily.
So I have flashed on the screen a simple example from Matlab, Matlab has a variable called
Real max that tells you what the maximum number is, you can see this number here ok,
approximately 1.8 into 10 to the power 308, this is the maximum number that Matlab can
represent. Similarly, I have a minimum number which is approximately 2.2 into 10 to the
power – 308, so this is from Matlab.
So just like range, you have a slightly different idea called precision, so let us 1 st start with an
example ok. So let us say I have square root 2 and once again I am writing this in Matlab, you
will see some set of digits okay, you will see a standard set of numbers thrown here and let
me add a certain amount of error, a small number to it okay, so this number here is 10 power
- 14, I am adding this here, what you will see between here which is the original and this
which is the case with the error, you will see all digits are the same except for this digit which
was 0 here it became 1 here because I added 10 power – 14 ok.
318
So now let us say if I added instead of 10 power -14, suppose I add a 10 power -16, what is it
that we would expect? So since this was 10 power -14, 10 power -16 is this so suppose I add
10 to the power -16, this 5 should actually turn to a 6 right, that is what we would expect so
let us see what happens. So suppose I have 10 power -16, I add it to square root 2,
surprisingly enough the 16th digit stays the same okay. Why did this happen? We can now
look at another example, so let us say A is 1, B is - 1, this error once again I will call it error
is 10 power -16. So suppose I do A + B + error, it gives me the right thing because A + B is
0, 0 + error is 10 power -16.
But suppose I change the order that B + error instead of that I write error + B, we know that
addition is you know you have distribution, you have associativity, commutativity, all those
properties are there in this case, commutation between B + error and error + B, it should give
you the same result but it gives you 0 ok which also seems to be happening here instead of
adding error it is actually adding 0. Now why does this happen? The reason is, notice in both
these cases even the mantissa, not just the exponent, even though the exponent allows you to
go till 10 power -308, the mantissa is also limited.
The mantissa is limited up to 52 bits okay so the mantissa remember is 1 point something, the
number of boxes I have to store here into total power e, E we saw in the previous slide now
we are looking at this portion okay, what happens here. So double precision now is now
given by 2 to the power -52 okay so that is the minimum that you can represent which turns
out to be approximately 2.2 into 10 to the power -16. So any number below this will simply
disappear so just to give you an example, now suppose you have a calculator and it has let us
say it is a very bad calculator it has 3 digit only that it can represent on the screen so 0.00,
now suppose I give you the number 0.001, there is no space for it to store it.
To give you another example, suppose I have 1.00 + 0.001, what will happen is I take 1.00, it
has used up my 3 digits and if I give 0.001, this is out of range what it will do is it will simply
give me 1.00, this cannot come down at all ok. Now how does that affect this? Notice that
when I do A + B, A + B is already 0 so this is the order in which the machine will do the
algorithm, it will do the addition, A + B is 0 + error it has enough space ok it has 16 digits for
you to be seeing 10 to the power - 16, however when I do A + error + B, something similar
happens, I have 1.000 16 digits + 0.0001 no place to add it, so it basically sees this as A +
error as simply 1 and B is - 1 which is why it gives 0 ok.
319
So if you go to Matlab once again, the smallest number the restriction that is given by the
mantissa is called Machine Epsilon okay. So if you simply put EPS in Matlab you will get
this value as you can see it in approximately 2.22 into 10 to the power -16 okay, so this is the
smallest number that the machine can represent in terms of floating point additions and
subtractions okay.
Now we come, because of the combination of these 2, because of the combination of the
range and the precision, the main thing once again main takeaway is that there are smallest
and biggest numbers that the machine can accurately add and subtract okay. Now you have 2
types of errors, so an underflow error is what happens in case numbers near 0 are rounded off
to 0 ok, so this is the same kind of example that we saw last time, you had 10 to the power -
16 when you add it to square root 2 nothing really happens here so effectively 10 to the
power -16 is being rounded off to 0 okay so you are not getting any addition here.
So here is a simple figure to represent this so let us say this is the max positive number that
you can represent and this is the mean positive number, this is the negative limit, when I call
it negative max what I obviously mean is maximum in terms of absolute value. So whenever
you are kind of caught between these 2 limits okay so let us say -10 to the power -16 and 10
to the power -16, it is called underflow error, in some sense you can see this is going below
the least count of the machine okay so just like our scale has a least count most scales like 1
MM below that you cannot measure accurately so similarly, below 10 power -16 for double
precision you will have trouble okay so if we have numbers going below that and you do not
account for them in terms of the exponent and this Separately you are going to have trouble.
320
Another thing that can happen in terms of underflow is you might have a divide by 0 error
okay, so even though your denominator is not really 0 but if it goes below your machine
Epsilon or you know even your minimum 10 power -308, you can actually have divide by 0
error, it can occur in many different ways ok. Overflow happens when you actually go above
the maximum limit okay, so let us see an example, so let us say you have a simple
expression, we will see that this is a special case of something called soft max as we move
into the neural network portion, but let us say you have a simple function E power X 1 by E
power X 1 + E power X okay.
Now let us say X 1 = X, in such a case it simply give you half okay. Since X 1 = X you
simply have to E power X 1 by E power X 1 + E power X, so this is obviously half. So let us
try this in Matlab, let us say I take a vector, this vector now is 5000 and 5000 these are just 2
numbers, X 1 = 5000, X 2 = 5000 and I tried this expression, I do E power X 1 by E power X
1 + E power X 2, I get not a number. Now why is that because E power 5000 has exceeded
your maximum possible?
So even though the calculation is badly simple you can do it by hand this is what I meant in
the initial side of this video which is that there are certain things that you can do by hands
very easily but the machine being dumb and being doing sequential operations will simply do
E power 500 first and it will say well I cannot store it so this is not a number. It turns out that
there are ways of tricking the machine into doing the right thing okay so I will just show one
example here, so instead of doing your calculations in terms of X, we subtract out okay so Z
is X - maximum of X okay, or Z I is X I - over all I X I ok. So if you subtract that thing out it
turns out that this function does not change because you are simply multiplying by E power -
max X on the numerator and the denominator.
Now if I write it that way and I do E power Z 1 by E power Z 1+ Z 2, I get back the right
result ok. So the point is, if you simply wrote this in your code in your program, if you are
lucky nothing would happen, if you are unlucky you might get not a number even though you
might be confused about where this not a number came from. So the fact that the machine has
a maximum and minimum can cause surprising errors okay, you might not have a formula
problem, you might not have a compilation problem but you could have a overflow and
underflow problem because you have not accounted for the way numbers are registered. In
fact, Open AI one of the companies that works on AI is now trying to exploit the fact that
321
there is finite precision in order to come up with some machine learning algorithms so that is
well beyond the scope of this course but I just wanted to point that out ok.
322
Now it turns out that the Ariane 5 disaster which I started this video which was also due to a
overflow problem okay. So remember that it was all fine in the beginning and the internal
enquiry board, the report is actually available online, I have put a reference to that
somewhere later here ok, so since it is a French system this I am going to call it SRI, it is
actually inertial reference system so the inertial reference system tells you you know which
way the rocket is pointing very-very roughly, so it it had a variable just like we had the
variable in the previous slide, it had a variable called BH which is actually used to determine
the orientation of the rocket, is it pointing up or is it pointing down, etc and this orientation
was represented by a floating point variable, a real number okay.
Now this variable was stored in a 64-bit floating point operation okay, but due to several
internal reasons part of the reason being that the previous one Ariane 4 used some 16-bit
integers so what it had to do was it had to turn this BH from a 64-bit floating point number
into a 16-bit signed integer. Now this was not any problem for the previous version, I have
this is a mistake, this should be Ariane 4 so the previous version of this launch the vehicle
was Ariane 4, it was not a problem because all the numbers for the orientation were well
within orientation speed, etc, were well within the limits of a 16-bit integer.
However, after 37 seconds okay it basically reached overflow okay so some number within
the calculation actually went beyond the 16-bit limit, so 16-bit non-signed is 65,000, signed is
32,000 so in either case this number was exceeded due to the vast acceleration that was there
in Ariane 5 in comparison to Ariane 4. So some numbers were exceeded and you can see now
323
because of that it essentially got confused so instead of going straight up, the orientation was
miss read, it actually turned and then the self-destruct mechanism to cover okay. So in the
words of the report, so the report says the internal inertial reference system software was
cause due to the conversion from 64-bit to 16-bit signed integer value ok so as I said here this
had a value greater then what would be represented so this is classic overflow.
So the overflow caused the Ariane 5 disaster, this is to tell you that though in the example
that I gave it seem like extreme examples, it can actually have very-very real-life effects
okay, so similar problems have happened in other cases during Gulf war, etc. So the fact that
the machine is representing numbers in a finite precision have to be sometimes accounted for
okay, so if you are lucky it will almost never happen but if you have a completely
unexplainable phenomena happening to you where everything seems to work on paper and it
seems to track maybe sometimes it could be an underflow or overflow error, thank you.
So the last topic in this video is that of condition number okay, so the fact that you have
limited precision which is what we have been looking at in the previous slides can have many
unexpected results, some of them you have already saw. So let us take a simple case, we are
simply summing up s or this number 0.0001 and we are assuming it up 10,000 times okay, so
what would you expect? 0.0001 multiplied by 10,000 should be one okay. If you actually
execute this program you will find that it is not quite 1, notice that there is some error in the
last 2 places.
324
Now why does this happen? This is because precision has affects that propagates okay, what
do I mean by that? Remember we are adding 0.0001, this is 10 power -4, this does not have
an exact representation in binary, it has a repeating decimal representation in binary so that
the last digit when it gets chopped off have an actual effect okay. So in that case as you add
this problem in 10 power -4 in the 16 digit many-many times the effect actually starts from
your end starts leaking upwards towards the left okay. So this effect of additive effect of
precision can be particularly bad okay, if you have multiple calculations so I will show you
one example.
Let us say you are solving a system of linear equations okay, so let us give you an example,
so this is the system A, let us say this is the matrix A, 1, 2, 2 and 4.0000 there is a 1 okay
sitting somewhere in A ok. So let us say my X is this; 1 and -1 fairly simple example,
suppose I define d = K times X, this of course means that X = A inverse B so if I do, X is A
inverse B, I should recover X, I do not quite recover X you can see that instead of 1 the affect
propagated a little about by 5 digits. Similarly, instead of getting - 1 I got some error
propagation ok however, something even more serious can happen okay now because
typically you solve X as A inverse B, let us say I introduce an error in B ok.
Now instead of B being this suppose I added 0 .01 okay or subtracted 0.01, so you can see
that now between this and this I have made a small difference, small change in B so B goes to
B + Delta B or B 1 is B + Delta B. Now the question is if I change B to B + Delta B, A
remains the same, what is the change in X? That is if my number remembers since I am doing
is finite precision arithmetic you saw earlier that some numbers might not be exactly
represented okay. So if I make a small change in a number instead of storing 1, I restore
1.00001 how much of a change will it make while solving linear systems of equations? Okay,
so when you do this if I do X 1 is A inverse B 1, what I would expect is only a small change
in X since I have made only a small change in B but you can see this is a huge change okay.
Now from being 1 - 1 it has actually turned into 10 power + 8, the sign has changed -10
power 8 and 1 into 10 power 8. So just a small change of 0.01 in B has caused the change of
10 power 8 in X so this is quite worrying, so this is why we look at what is the nature of this
matrix A ok. So just like when you have division by numbers, so suppose I have Y = A
divided by X and if X is very-very small then small changes in A can cause large changes in
Y similarly, if A is close to singular, a small change in B can actually be magnified by A
okay.
325
So this is measured by something called the condition number, condition number is defined
as norm of A remember you can go back to our norm videos, norm of A is some measure of
A multiplied by norm of A inverse is given as condition number. For symmetric matrices
there is an easy way of measuring this condition number okay witches you find out the ratio
the maximum ratio of eigenvalues, which is find out the maximum eigenvalue in magnitude
and divide by the minimum eigenvalue magnitude and that tells you roughly how much your
answer is going to be banking side.
So for example, we can see that 2 decimal places were increased okay, so this this was
increased to 10 power 8 which means there is an increase in 10 decimal places so basically
are magnifying an error by a factor of 10 power 10 and if you look at the condition number of
this matrix just to clarify this. So if you see condition number this is 10 power 11 and it tells
you very roughly this is not very precise, it tells you very roughly that your answers are going
to be magnified by a factor of 10 power 11 that is the worst-case scenario and we are getting
close to the worst-case scenario here ok.
So in general if you have a high condition number, this means you have a poorly conditioned
matrix and in certain software for example, Matlab will tell you, you will have an ill
condition matrix. Ill conditioned means the condition number is really high and any small
errors may be magnified very-very largely okay.
So in this video, just a recapitulation we looked at a few ideas, a few implications of the fact
that your numbers are not exactly represented as you might think that there is a finite amount
of precision and that finite amount of precision has a minimum limit and maximum limit and
sometimes this imprecision can multiply upon themselves and lead to large really poor
effects, thank you.
326
Machine learning for engineering and science applications
Professor Dr. Balaji Srinivas
Department of Computer Mechanical Engineering
Derivatives, Gradient, Hessian, Jacobian, Taylor Series
This week we will be dealing with optimisation and as you would know from your
experience in school as well as in college, almost all optimisation involves you to find out
derivatives. So in this video we will be looking at derivatives so little bit of warning both this
video and the next one which will deal with what is called matrix calculus, they will be
widely advanced material.
Some of it you will once again be already familiar with in the one-dimensional context or in
the context of scalars and we will be looking at the context of vectors also. We have only a
few slides to go through both in this video as well as in the next but the materials are little bit
dense so please concentrate on this material and if it is not very-very clear, you will still get
clarified as the course goes on okay.
So let us first look at the idea of derivatives which is essential for any sort of optimisation. So
derivatives typically measure how one quantity changes when there is a small change in
another. So if you have something like dY dX it means how much does Y change given that
X changes buy a certain amount okay. So, as you would know geometrically in a simple
scalar case we look at this as the slope of a tangent okay. So if this is the curve Y = F of X
327
then if this is the point let us say p, if you differentiate Y with respect to X at X equal to p
you will get the slope of this tangent, of course you can denote this as dY dX at X equal to P
or you can denote this as F prime, some people will simply call it p or people will call it F
prime X equal to p so there are multiple ways of denoting this, you would be familiar with all
of this once again from your prior experience.
So we know that this slope essentially is can be written as limits of a small perturbation of X,
so X + H – F of X by H, this of course is the limits of this secants ok as they go towards this
point and become a tangent, so the slope at this point X and X + H, so you find out the
difference in values and as this limit tends to 0, the slope will tend to a finite value and that is
what we call the derivative of the slope at that point. Now, when you have higher dimensions,
by higher dimensions I simply mean you still have a scalar function but X now is a vector
okay. So in that case, X vector could be something like let us say X 1, X 2, X 3, or the figure
that I will show shortly could be X factor is X 1, X 2 which means X belongs to R 2, this is
the case where X belongs to R 3 okay.
So in such a case we can have partial derivative, so let us look at such an example let us say Z
if F of X and Y okay, now if you want to denote or visualize Z, you simply have the variables
X and Y, as they change Z changes and you see here 1 whole surface okay for Z. Now I could
want to know what is Del Z, Del X ok that is at a particular point let us say this point I might
want to know if I just change X and I keep Y fix you would have seen such thing in
thermodynamics perhaps but if change X and keep Y fix, you might want to know how much
does Z changes.
Now the way to see that geometrically through let us say you draw a cross-section something
of this sort okay. Let us say Y is fixed at Y equal to in this case 1 and you can try and find out
what this derivative is. A generalisation of this idea is with N variables okay, so here F is a
function that takes in a vector, in this case the vector is A, which is in R N which has N
components and it gives back a single scalar.
And if you want to find out Del of Del X I that is just like in this case I want the derivative
with respect to X then all you do is you change only that variable so in this case for example,
I change only the Ith variable, I perturbed it by a little bit so I do A I goes to A I + H and then
find out how much does the function changes when I just change this variable and that limit
as H tends to 0 is what is called the partial derivative of F with respect to the variable X I
okay.
328
Now reduced to a one-dimensional problem this is what it would look like, this is simply the
cross-section of this function at Y equal to 1 and if I want the slope now then all I will do is,
let us say I will change X by a little bit. So suppose I want Del Z, Del X at X equal to 1, Y
equal to 1 then I take a cross-section, where Y is fixed at 1 and evaluate the slope at X equal
to 1 by just changing X and that slope will actually gave me the value of this partial
derivative okay, so this is the idea of partial derivative again you should be familiar with this
from multivariable calculus before.
Now we can generalise this idea okay, of derivatives to what is called the gradient which we
will be using very-very often okay once again. So let us say you have let us call this X, Y and
F of X Y, you can call it Z = F of X Y okay, so you have a curve of this sort okay. There are
several noticeable things here, so suppose I want to say how much does the value of the
function change at this point that notion by itself does not become intelligent unless you say
how much does it changes with X.
So you have Del F Del X and you also have Del F Del Y okay. In fact, instead of just looking
at these 2 directions so Del F Del X would be the change in the direction X and Del F Del Y
will be the change in direction Y, you could ask a 3rd direction, I could call it Del F Del V,
where V is some arbitrary direction okay. So if this is X, this is Y, V could be some 3 rd
direction altogether.
So the gradient is defined as basically a concatenation or putting together of all these partial
derivative, so in this case with the two-dimensional case we have to search partial derivatives,
329
in the n dimensional case you will have n such partial derivatives and you would basically
write the gradient of F in my case would be Del F Del X 1, Del F Del X 2 okay. So in the n
dimensional case it is Del F Del X 1, Del F Del X 2, so on and so forth up till Del F Del X n,
and I have put a transpose there to show that this is a vector, some people eliminate the
transpose, some people put the transpose either is fine. In this case now notice this is a vector
and we look at a more general case of this in the next video which would be matrix calculus
video but this gradient is used very-very often okay.
Now what does the gradient physically represents? Okay, so if we see that here are couple of
figures to clarify this idea, so let us look at the 1st figure. The 1st figure is just showing the
shading, now imagine this curve here, if it is collapsed okay imagine it is a bunch of springs
and you just collapse it and you will see these things here, the projection here are called
contours, what does the contour mean? If I take this contour and raise it up to the curve, it has
all the values at this value of X and Y all of these places Z has the same value okay, so these
are what are called level sets or contours which we will look at a little bit later in this video
series also.
Now this is shaded according to value for example, here the value of the function is high,
here the value of the function is low so the place where the value of the function is high is
shaded as dark black and later on it is shaded white okay. Now the gradient notice is a vector
and the direction of the gradient tells you in which direction is the change the sharpest okay,
so the change is the highest in the direction of the gradient okay. So in this case for example
all the change is the sharpest in the horizontal direction okay. This of course is colour-coded
now okay red means high, blue means low so this is just simply colour-coded but it is the
same idea. Some of you who have worked in fluid mechanics might have seen this or even in
other fields. So now you notice this, these are arrows here and the arrows are aligned along
the direction of maximum change.
Now if you have a more complex curve something of this sort, once again you can draw the
gradient field, why is it the gradient field at any point? I have an F, I have Del F Del X 1 and
I have Del F Del X 2, these 2 put together define a vector and that vector is what is drawn
here, longer arrows means higher gradients and shorter arrows means lower gradient. Now
one useful way of utilising the gradient vector is as I told you before, you might not only
want Del F Del X and Del F Del Y, you might also want Del F Del V, where V is some other
direction. So suppose X and Y are orthogonal and V is a 3 rd direction, suppose you want Del
330
F Del V, what does that mean? Physically it means if I move in the direction V or V hat, how
much will the function change?
And this is fairly easy, all you do is take the gradient which we have defined before, this is
Del F Del X 1, Del F Del X 2 so on and so forth up to Del F Del X n, this vector dotted with
the direction V ok. You can simply see special cases if V was I cap okay or the X 1 direction
then gradient in the direction V should be Del F Del X 1 which is correct okay, so this retains
the meaning of partial derivatives. Similarly, if you take the direction 2 you will get Del F
Del X 2 so on and so forth, so for the coordinate axis this kind of reduces trivially but in the
general case you simply take a dot product along that direction.
So next we come to the idea of hessian, this is basically the gradient of the gradient. Now
remember, we will see this once again in the next slide, the gradient is a vector now you are
trying to find out how much does this vector change okay as you move in space. Now why
would he would use some such complicated quantity because it is equivalent of the 2 nd
derivative in scalar calculus.
So all the uses that we had for 2 nd derivative like finding out whether something is a
maximum or minimum all those uses also pass on to the hessian as we will see in some of the
videos in this week okay. So suppose F is a function, remember what this means F is taking
in, it is a box that takes in a vector as input so X is a vector and what it gives out is a scalar.
331
In such a case the hessian is defined as Del square F Del X I Del X J okay, so hessian is a
matrix, every entry of the matrix is basically a partial derivative, F is a scalar so first entry is
for example, Del square F Del X 1 square so the N comma 2 entries; Del square F, Del X and
Del X 2 so on and so forth, this is N cross N matrix. You can also notice that this is a
symmetric matrix, notice that these 2 derivatives are just the same, Del square X Del X 1 Del
X 2 is the same as Del X 2 Del X 1. So hessian is a symmetric matrix and from our linear
algebra we would know that from real F this means hessian has real eigenvalues okay
eigenvectors, so we will use this property little bit later.
So one other quantity that we will like to define is that of a Jacobian, the Jacobian is
equivalent of a gradient for vector valued functions, so the gradient remember then we had
defined simple gradient was from scalar to vector but this assume that the value of the
function itself is a scalar. Now you can define a more general case where you have a vector
input and a vector output, a Hessian that we just looked at is very similar, you can see this as
the Jacobian of a gradient so the hessian took in the gradient of F and gave out Del square F,
where F is a scalar but grad of F is now a vector. In general, we define the Jacobian as Del
square F of X I by Del X J, remember that since F is now a vector, it is going to have a Ith
component.
So in general we are going to have Jacobian which is going to be M cross N, if it takes in a

vector, the size of the vector is N cross 1 and it gives out a vector which is M cross 1, so you
332
can write the whole of the Jacobian as a simple matrix okay. So we will be using Jacobian
only very rarely, but some general expressions we will show in the next video.
333
So the final idea for this video is that of a Taylor’s series, the Taylor’s series is extremely
useful whenever you try to approximate functions. So this is very-very widely used in
science, one of the most commonly used ideas in science, practically anything that people use
mathematics and calculus for somewhere or the other Taylor’s series will pop-up okay, so
this is true even of course for machine learning and optimisation okay. So there are of course
many subtle things about this Taylor’s series we are not going to look at that, we are just
going to look at the single slide for Taylor’s series and then we will be using it a little bit later
both in optimisation as well as in other parts of machine learning ok.
So remember that when you have scalar function one-dimensional function that the kind that
we use in school for example, F of X = e to the power X Sin X or something of that sort
okay, you can write the Taylor’s series as F of X is F at some other point X 0 okay so you
want to approximate the value at some value X given that you know the value at X 0, you
also know derivatives at X 0, etc so this is the basic idea of Taylor’s series okay. So if you
have F of X, it = F of X 0 + X – X 0 times dF dX, this df dx is calculated at X equal to X 0 +
half of X – X 0 square D square F dX square, this is also calculated at X 0.
An example of this which you might or might not have realised is our idea of S = U T + half
A T square okay, so that is very-very similar to this you know dX is like U and D square F
dX square is like A, this is the time that has elapsed, this is T-square so the half is very
similar to that, S is the total distance travelled okay so that is the special case of the Taylor’s
series and you can easily explained it in the Taylor’s series if you have more than the
334
acceleration. So if only U and A exist then S expression would be what I told you but if you
have U, A and what is called the jerk which is the 3rd derivative of this distance with respect
to time then you would have that + 1 by 6 X – X 0 d cube F dx cube so on and so forth.
So the Taylor’s series should be familiar to you but most probably you would have not seen it
in the case of vectors okay. So in case X instead of being a scalar it is now a vector, you can
now write the Taylor’s series, notice the similarities between these 2 expressions, F of X now
remember X is a vector is F of X 0 which is the same thing + now notice this is a vector okay.
X – X 0 transpose times G, G is the gradient okay just for compactness I have written this as
G, so instead of dF dX now you have a full gradient, this is a full vector, this effectively is the
dot product between one vector and the other okay.
This remember is something that we had discussed earlier, this is called a quadratic form, so
we have X – X transpose H X – X 0, it is still in the scalar case will be equivalent to X – X 0
square okay, but in the vector case you cannot write it as X – X 0 square, it is X – X 0
transpose H, H is the hessian which we had seen earlier multiplied by X – X 0 + higher-order
terms.
Luckily practically nowhere especially in the vector case do people use this okay, so this is
usually maximum that we will go. So we will go to the 1 st order term which is the gradient
and the second-order term which is the hessian and this is sufficient for most practical
purposes okay. So, as I had said earlier we had defined G as the gradient and H as the hessian
also calculated at X 0.
So this is just some preliminaries for a multivariable calculus, we will be using all these ideas
only sparingly but you do need it in terms of rebuilding your intuition, so in case if it is not
clear please revisit this video a few times. In the next video we will be looking at a few
simple mathematical relations in matrix calculus okay like this one that is slightly advanced
material.
335
Matrix Calculus (Slightly Advanced)
In this video we will be looking at matrix calculus, this is a short video the portions are
slightly advanced, once again like with some portions of the probability series you are free to
skip this I would still recommend that you go through this and see some of the relations if
you are not able to understand them or fully exploit them that is fine because we will be not
using this for most part of the course about 90 percent of the course can be done even without
understanding this very well.
336
So here is motivation for why we are looking at it, please remember that as we had said in the
first couple of weeks machine learning basically requires you to take some input vector and
change it into some output vector. Now what will often happen during training is you will
notice that the input vector that you have given does not quite give whatever required output
vector you have, whatever you would require.
So for example you know if your output changes with respect to some set of parameters that
you have, you would like to know how much will the output change provided I turn a few
knobs or I change a few parameters. So in such cases you basically need to know how one
vector changes with respect to another vector or another some other parameter. So a standard
way of how we measure, how one quantity changes with respect to another of course is the
partial derivative if you have two scalars you know very well how to change you know or
find out how one function changes with respect to a parameter x. So all this is of course a
subset of calculus.
Now we are going to slightly extend this idea into matrix calculus which basically means
how does one vector change with respect to another vector and how do you parameterize this
and what are some of the basic relations this is going to be a very initial or preliminary class I
am going to use only some of the relations that we will require in terms of machine learning,
of course much more advanced material exists as I said before in the introduction it is useful
to understand these but in case you do want to go ahead with the course even without
understanding this material that is fine you will be able to extract 90 percent of the
information of this course anyway.
337
Of course in comparison to the relations that I am showing there are many more advanced
relations which do exist. A good source is the one that I have flashed on the screen right now
and there are sources within this website also which go into greater detail.
So let us look at one simple case, we will assume of course that you know how to
differentiate one scalar with respect to another, but let us say you have a scalar with respect to
a vector or vector with respect to a scalar in this case, this is del a vector with respect to x
remember this is a vector and this is a scalar. So if you differentiate a vector with respect to a
scalar the result that you get is a vector.
So for example let us say a vector is x square x cube x cube r 5 and x is a scalar, so del a
vector del x also has three components which is going to be 2x 3x square and 5x to the power
4 which is what is represented here the ith component of this vector is simply del ai del x. So
I hope this portion is clear, you could have the reverse case where you have a scalar
differentiated with respect to a vector, so del x del ai an example would be something like.
So this is a scalar function, it takes three inputs x, y, z we could call this x vector or if you are
interested we can even call this x 1, x 2, x 3 square and then del f with respect to del x vector
this is of course what we call the gradient, this is now a vector. The first component is going
to be del f del x 1, the second component is going to be del f del x 2, the third component is
going to be del f del x 3, I will put a transpose here because this is right now a row vector you
can turn it into a column vector.
338
So del f del x 1 is x 2 x 3 square, x 1 x 3 square, 2x 1 x 2 x 3. So this is differentiation of a
scalar with respect to a vector. So a gradient is a prototypical example of some such thing it
also results in a vector.
Now let us look at a slightly more involved case the differentiation of one vector with respect
to another vector, where does it occur physically? I mean of course in machine learning we
might not necessarily look at you know physical examples but just for physical intuition let
us say you have velocity in a particular room of the air and you want to differentiate it with
respect to the position. So in each point at each point you can have x velocity, y velocity, z
velocity changing with respect to the location x, y, z so as I move to a different point all three
components will change.
So this is a vector differentiated with respect to another vector and what it results in is a
matrix or a what we call a second order tensor. It is actually a fairly relationship ith
component of i differentiated with respect to jth component of b. So for example if we have a
is the vector a 1 a 2 and b is the vector b 1 b 2 then del a vector by del b vector and del a 2
with respect to b 2, so this is a matrix and that is what is written here del a del b ij is equal to
del a i del b j.
This relation we will actually be using you know a little bit more. So this is differentiation of
a dot product with respect to x of course this is actually speaking, so let us say this is a vector
and this is a vector the product is of course going to be a scalar. So this is a special case of the
339
previous example we had seen, but unlike the previous case an x is actually occurring here,
okay.
So remember x dot a can be written as x transpose a where x is a vector you transpose the
matrix a row matrix multiplied by a column matrix, it can also be written as a transpose x
both are the same because the product is actually a scalar and you can show that del del x of x
dot a is equal to a vector. So I will just quickly show it to you in a special case you can also
show this in general.
So let us say x vector is x 1 x 2 x 3 and a vector is a 1 a 2 a 3, so this means x dot a as you

know is a 1 x 1 plus a 2 x 2 plus a 3 x 3 remember this is a sum and now you have a scalar.
Now del del x vector of x dot a is going to be del del x 1 of x dot a as we saw in the previous
slide del del x 2 x dot a and del del x 3 x dot a. Now you can immediately see del del x 1 of
this is a 1, del del x 2 of this is a 2, so this is a vector, okay. So hence you can proof this I
have shown this in the 3 dimensional case you can of course show this in n dimensions quite
easily.
Now let us come to the next level of evaluation which is matrices with respect to vectors. So
suppose you have product of two matrices AB and you are differentiating it with respect to a
vector x, it could be a vector or a scalar in either case the relationship that is shown here
actually holds, del a del x times b so this is the equivalent of chain rule remember that the
order AB never changes. So this is del a del x times b, do not write this as you know del a del
x times b plus del b del x times a so this does not work because matrix product in general
340
does not commute. So the chain rule you have to be careful about the order of the matrices
being used.
And this the relation I am showing now the result I am showing now is important along with
the dot product rule that I showed a little bit below. Remember that let us say A is an n cross
n matrix then x transpose a n is actually a scalar, x transpose Ax is called a the quadratic or a
quadratic form you can review your linear algebra lectures where this was discussed so x
transpose Ax is a quadratic form this is all put together is scalar, why? Because this is n cross
n, this is n cross 1 and this is 1 cross n. So if you multiply all of them you will get 1 cross 1
so you will see natural places where this tends to occur.
Now what happens is the derivative of this with respect to x remember derivative of scalar
with respect to a vector also has to be a vector and indeed it is a vector so this is n cross n, n
cross n, n cross 1 so the result actually is a n cross 1 vector, scalar with respect to vector is
actually a vector. So the proof of this we are not going to do this, so the proof is actually
slightly involved I will show you a quick verification just like with the dot product case on
the next slide.
A useful relationship is that in case you have a matrix which is symmetric. So symmetric A
means A is equal to A transpose, so this relation simply becomes 2Ax and it looks
remarkably like our scalar formula which is d by dx of let us say some alpha x square is equal
to 2 alpha x, of course this is for scalar x here you have to be little bit careful, this is x
transpose Ax and you get 2 times Ax for symmetric A.
341
So let us look at the quadratic form and look at a quick verification, so we will look at the
verification for a 2 by 2 case. So let us assume A is of the form A 11 A 12 A 21 A 22 of
course x transpose will be x 1 x 2 and x can be written as x 1 x 2. We call that x transpose Ax
is essentially summation over i and j of all products of the form A ij x i x j, you can of course
multiply it out using the usual matrix multiplication and find this, but if you do this so you
will get A 11 x 1 square plus A 12 x 1 x 2 plus A 21 x 2 x 1 plus A 22 x 2 square.
So now if I have to find out del del x vector of this so let us call this Q, so as before this is del
Q del x 1 del Q del x 2. So we know del Q del x 1 is equal to 2A 11 x 1 plus x 2 times A 12
plus A 21. Similarly del Q del x 2 is equal to this term does not contribute anything you get x
1 times A 12 plus A 21 plus 2A 22 x 2. So now this can be written as please notice this 2A 11
A 12 plus A 21 again A 12 plus A 21 2A 22 times x 1 x 2, this matrix of course is A plus A
transpose, so this you can quickly see if you do A transpose it is going to be A 12 A 21 you
add the two you are going to get this.
So this is A plus A transpose times x, so you have verified these relationships. So we will be
using these relationships OFF and ON during the rest of the course, thank you.
342
Optimization – 1 Unconstrained Optimization
In this video we will be looking at some beginning of optimization specifically unconstrained

optimization.
So the relevance of optimization machine learning is very very high, as we saw in the first
week the basic idea behind most of machine learning is that you want to build models so data
343
models that input they take some input and map it to output data. Now usually (what the) our
maps depends on certain parameters and the way we improve our models as you will see in
the next week is based on something called training that is you give more and more data ad
try and improve your parameters.
So usually you would like to know how much does your output change depending on the
parameter, so you have some quantity which is a vector quantity that is changing based on
some other vector quantity. So most of it most of our machine learning is dependent on
finding out the best or optimal model for some given set of data so most machine learning
problems can usually be rewritten as optimization problems.
So what we will be doing in the next series of videos is to try and introduce as well as review
some of you will be familiar with some of these ideas already so we will try to introduce
some optimization techniques in the coming videos.
So typically what you try to do in a general optimization task is to maximize or minimize a

function once again like in the previous videos this function can be something that takes in a
vector and gives out a scalar. So this function the one that you are trying to maximize or
minimize is called an objective function or a cost function or a loss function. So this
terminology is used interchangeably.
So the function usually in a general optimization task can either be a scalar and this is called a
single objective optimization problem or you can have f itself as being a vector. So in that
case it could be a multi objective vector and in this course we are going to restrict ourselves
344
to this case to the single objective optimization problem is of even that is an involved
problem.
So we will be only dealing with that and this is actually true of most of practical machine
learning anyway we try and define a cost function or an objective function which is a scalar
for itself, x in general remember is a vector and typically we are going to deal with the case
where f goes from R n to R. So an example of such an f for example could be f of x vector
which is 3 dimensional, so x 1 square plus x 2 square plus x 3 square, here f is going from R
3 to R.
Now even though the general optimization task is to either maximize or minimize a function,
we will typically talk only about minimization because all optimization problems can be
called as minimization problems, why? Because if it is a maximization problem you simply
minimize minus f of x, so whenever I will be talking in the next few slides as well as in the
next video I will only be talking about minimization because maximization is a trivial change
by simply changing the sign.
Now here is some notation the optimal solution let me write the right word optimal or the
minimal so we will be writing that as x star this star denotes optimal. Now notice the term arg
min, min of f of x would simply mean minimal value of f, arg mean of f of x is that x which
results in minimum of f. So just to give you an example if f of x is let us say x square plus 1
then minimum of f is 1, but arg minimum of f is what value of x gave you the value of f equal
to 1 this is x equal to 0. So we will be using this notation quite often, arg min is that argument
or that value of x which gives us minimum of f of x.
345
So here is a quick review of scalar optimization, so as you remember if you have some
function f of x versus x it is in general going to be a curve and you are going to have various
minima for now we are going to look at the unconstrained problem, unconstrained problem
means there are no constraints on x there are no limits on x, we are looking at x belonging to
the whole of the real line, we will look at the constrained case in the next video for now in
this video we are only looking at the unconstrained problems, so we will assume x can go
from minus infinity to plus infinity, okay.
So in such a case you could have some global minimum and global maximum and you could
also have a local minimum and local maximum that is locally if I just put a box here all the
values around the local minimum are greater than the minimum local minimum but this might
not be the global maximum or the global minimum.
Now it can be shown that both these extrema we are not going to show it but both these
extrema whether it is local minimum of local maximum all of them will have the property
that f prime x equal to 0 in the unconstrained case. So these points are called stationary points
or critical points. So the stationary point as I have just shown could be a local minimum or a
local maximum or something called a saddle point.
Now how do we figure out whether it is a local minimum or a local maximum? Typically you
look at the second and higher derivatives we will look at just the second derivative case here.
So if f double prime x if the second derivative is positive for example here in such a case it is
a local minimum. For example if you look at the slope here you will see that the slope of the
346
slope okay so as I move away from here the slope increases which means that it is a
minimum here.
So here the slope is 0, here the slope is positive so that is why del square f del x square is a
local minimum as an exercise you can try and proof this, this is an optional exercise for those
who are interested you can try and proof this using the Taylor series. Similarly if f double
prime x is less than 0 then it is a local maximum once again it has the same idea. Now it can
happen that your f double prime x is actually 0, in such a case it is called a saddle point this
for example is if you look at f of x equal to x cube around x equal to 0 this is precisely what
happens.
Now what is happening here? This is like the shape of a horse’s saddle as we will see in
multiple dimensions also. You will see that in this direction there is an increase, in this
direction there is a decrease, it is sort of the combination of this curve and this curve. So from
one side it looks like a local minimum and from another side it looks like a local maximum.
So this happens when f double prime x is also equal to 0 and in such a case it could be a
saddle point. All of you are familiar with the notation of global maximum and minimum this
is the absolute maximum or the absolute minimum that you will get over all of space.
So now let us look at the multivariate case. In this case you are trying to find out f of x once
again which is x that minimizes f of x, but x now belongs to R n instead of simply belonging
to R now it belongs to R n, once again we are looking at the unconstrained problem there are
no constraints on x. Now as we saw in the derivate and gradient slides since now x is a vector
347
quantity we will now have to evaluate instead of simply df dx, you have to now evaluate the
gradient of f.
So in analogy to what we saw earlier any local extremum, so for example here this is a local
extremum in this case the gradient will be 0, remember this is the 0 vector which means del f
del x 1 will be 0, del f del x 2 will be 0, so on and so forth if it is an n dimensional vector x
del f del xn is actually going to be 0. Once again these are called stationary points or critical
points and like in the 1 dimensional case you could have a local minimum, local maximum or
a saddle point.
So some examples are given here, this is a local as well as a global maximum, here for
example is a local minimum which is not a global minimum because there are values lower
than this going on and this is the example of a classic saddle point, in one direction it is a
local maximum and in another direction it is a local minimum so that is what a typical saddle
point looks like.
Now how you find out whether this is a local minimum, maximum or whether it is a saddle
point now depends on the Hessian rather than the simple second derivative remember for
vectors the generalization of a second derivative is the Hessian, okay. So as we saw in the
previous slides Hessian now is a matrix, now unlike before I cannot simply say Hessian is
positive that has no meaning because it is a full matrix.
So when Hessian is positive definite, you might remember this from the linear algebra slides
what does positive definite mean? Positive definite means all eigenvalues of H are positive.
So this is not even positive semi definite you have to have all values of H being actually
positive. So if that is the case and remember since the Hessian was a symmetric matrix we are
guaranteed to have real eigenvalues so that you can talk about this meaningfully.
So if the Hessian is positive definite then it is a local minimum, if the Hessian is negative
definite which would mean all eigenvalues are less than 0 then it is a (local minimum) sorry
local maximum and if the matrix is indefinite, what is meant by indefinite? It is neither
positive definite nor negative definite. So some eigenvalues are positive, some maybe
negative or some even if they are 0 then it is a saddle point, thank you.
348
Introduction to Constrained Optimization
In this video we will be looking at an introduction to constrained optimization, this is a very

very brief introduction because the topic is vast by itself, we looked at unconstrained
optimization in the previous video, this is just to introduce you very quickly to constrained
optimization we will not be doing any proofs this is just to tell you the overall idea of how
constrained optimization is introduced and how it works, we will look at proofs later on week
9 or 10 of this course.
349
So unlike unconstrained optimization where you find out minimum of f of x for any x, for the
constrained optimization task you find out f of x for certain constraints given on x. So for
example you could say find out minimum of the function of x 1, x 2, x 3 which is like this
where norm of x is greater than 1, we could also for example give engineering examples a
theoretical example would be something like you want to design a vehicle which goes really
fast but you also want to give a constrained on fuel efficiency, so you do not have unlimited
fuel.
You could also say something like you know find out the weight of the best weight for this
chair but at the very least you have a constrained that the weight of the chair has to be
positive because mathematically you might not always be constrained by that. So such cases
are very normal and in fact very very common, constrained optimization is probably the most
natural occurring optimization problem.
Now there can be two types of constraints one set of constraints are equality constraints. So
for example minimize f of x 1, x 2, x 3 some function given that x 1 plus x 2 plus x 3 equal to
1. For example, you could say something like given that I fixed the length of a wire, (what is
the area of maximum) what is the figure of maximum or minimum area that I can make,
okay.
Similarly, inequality constraint, inequality constraint is instead of giving x 1 plus x 2 plus x 3

equal to 1 you say x 1 plus x 2 plus x 3 has to be less than 1, find out my fastest going vehicle
350
given that fuel consumption has to be less than a give value this would be an example of a
constrained optimization with an inequality constraint.
Now we come to something called the canonical form just like all optimization problems
could be written as minimization problems it turns out all constrained optimization problems
can be written in a particular way. So the (overall method is very) overall expression is very
simple, minimize f of x remember I can always (minimize) maximize by putting minus f, so
minimize f of x subject to the constraint that x belongs to a given set S, okay.
So the expression for S will look like this at first side this will look quite complicated but it is
actually fairly straight forward. So instead of giving one constraint, I will give multiple
constraints. So let us say I have a whole bunch of equality constraints and a whole bunch of
inequality constraints, this set S for x which satisfy this constraint are called feasible points.
So for example let us say somebody comes with a design of a car and you had set a fuel
consumption rate of saying atleast it should give me you know 10 miles per litre, but they
give a design of a car which gives you 3 miles per litre and then you will say this is outside,
this is not a feasible design, okay. So the constraint is what decides whether your final design
or final optimum is feasible or infeasible.
So x that satisfies the constraint is called a feasible point and we have written this big
expression but these are actually fairly simple expressions. So instead of having one equality
constraint let us say you have multiple equality constraints. So you are designing a room you
could say something like length has to be this much, breadth has to be this much, but you are
free to decide on the height. In that case so your length and breadth would now be equality
constraints.
You could also have bunch of inequality constraints, so the equality constraints are simply
written (of) as g of i or g i of x is 0. For example, this can be rewritten as x 1 plus x 2 plus x 3
minus 1 equal to 0. So any equation by bringing the constants on one side you can write it as
an equality constraint and now this is your new g of x, g of x is x 1 plus x 2 plus x 3 minus 1.
Similarly, all inequality constraints can be written as something less than equal to 0, this can
be written as x 1 plus x 2 plus x 3 minus 1 is less than 0 and now your h of x would be x 1
plus x 2 plus x 3 minus 1. So the feasibility set is usually a combination of equality
constraints and inequality constraints.
351
Now how do we solve this problem? The expression I am going to write right now is actually
I am going to write it just of the (())(5:44) it will look a little bit complicated, even if your
familiar with what is called Lagrangian functions, but we will see this this is actually finally
turns out it is a fairly simple function, we will see details of this in week 9 or 10. So
remember that our constrained optimization problem wants us to minimize the function f of x
while ensuring that the point discovered belongs to the feasible set this is in general a difficult
problem.
A very common approach for this is what is called the generalized Lagrangian, what is a
generalized Lagrangian? This implies a standard trick in a lot of mathematics. Suppose your
finding a problem which is difficult to solve, you can actually simplify the problem by adding
extra variables, this is sort of like in geometry you made some extra constructions in order to
make proofs it is similar to that, you add a few things which are now there.
So we add a couple of things we add these two variables called lambda and alpha both are
vectors, here this was our original function and you create a new function called the
Lagrangian L this is the original function plus your equality functions multiplied by some
arbitrary constant plus your inequality function multiplied by another arbitrary constant,
okay. So you just sum this up and make a giant new function called L, which is f plus lambda
times g plus alpha times h.
Then it turns out please once again before seeing the expression, the expression looks
complicated but it is actually fairly straight forward as we will see later in week 8 or 9 till
352
then you will not be using constrained optimization problem. It turns out that the minimum of
f is exactly this minimum of maximum of maximum of L, so we will see details of this later
when we do what is called support vector machines.
For the extent of this all we would like you to recognize is to know that there is something
called constrained optimization which is different from unconstrained optimization, thank
you.
353
Introduction to Numerical Optimization Gradient Descent - 1
In this video we will be looking at an introduction to numerical optimization. So far the

optimization we had been looking at was essentially theoretical optimization we were just
looking at analytical expressions. In this video we will see an introduction to how we can do
the same thing numerically and specifically we will be looking at an algorithm called gradient
descent which is sort of the work horse for most of deep learning.
354
So why is it that we need numerical optimization? We were looking at shapes of this sort, so
far what we were looking at was a case where suppose you have some f of x and let us say x
is a vector and has two components x 1 and x 2. In that case if you knew f of x as an
analytical function of x 1 and x 2, then you could use you know various ideas such as setting
gradient of f equal to 0 and you have standard methodologies to find out what the appropriate
minimum or maximum is.
However, most of the cases what happens is we do not have explicit expressions. So you do
not really know what f is, so an explicit expression would be something of the sort J of w is w
1 square plus w 2 square plus w 3 square plus 4. A small note for starting from this video I
will start talking of optimization in terms of J and w because that is the notation we will
ultimately use when we go to deep learning.
So usually what we know is the function only as a black box so that is some w comes in or
some x comes in and some f comes out. So similarly some w comes in and some J comes out
you do not really know analytical expression is unknown. So this is a proper black box, we
will see a couple of sub cases of this later on in this video, but generally what happens is you
know let us say x 1 is 1, x 2 is 2 and it suddenly tells you that J or f is 5.
Similarly, anytime you given x 1 and x 2 is able to give you a J or an f, but you still want to
optimize it. In such case the methods that we use so far are not really usable. So in this case
in the case of deep learning this black box is typically a neural network or something of that
sort. So what we want to find out something that can deal directly with numbers rather than
355
with analytical expressions and that is why you need numerical optimization as against
analytical optimization.
So here is a simple idea what you want to do is you want to drive the gradient of the function
that you are trying to minimize or maximize to 0. So remember for this we are using the
notation the function is J and the variable that we are optimizing over we are calling that w,
so grad J w, we want this to go to 0 specifically the 0 vector because remember gradient is a
vector, but we do not have an analytical expression for J.
So the iterative process is as follows, you take a guess so this is always the iterative process
in anything not just optimization whichever variable you are trying to find out whether it is a
linear system of equations, whatever system of equations you are trying to solve or
optimizing you do not know w which is optimum. So you take a guess, okay. So here the
super script k refers to the iteration number we will see a few examples later on in the slide.
So you take a guess, so you run through the black box this gives you a value this gives you a
value of J, this might or might not be the optimal value, if you are a really good guesser you
will automatically get right value but generally you will not, okay then you find out gradient
of w. So now a question might arise if you only have a function as a black box how are you
going to find out gradient of w, we will discuss this in the subsequent video but assume that
you have a method of finding not only J but also gradient of J as a number.
Now suppose this gradient turns out to be 0 you stop if not you take a guess you take a new
guess. Now how do you take a new guess? Will you guess this randomly? No it turns out that
356
there are specific methods to find out improved guesses, based on the w you got and the grad
J you got you can actually get a better guess. So this method of improving your guess is what
is called gradient descent.
So let us take the example of a gradient descent in a simple scalar case, simple scalar case
like J of w where w is now a scalar a single number. So let us say it looks like this we did
may not know what the specific function is? We just need to know that we are trying to get
here. Now let us say this is the actual w optimum but your guess is this let me call it w 1 or w
0 this is our guess.
Now when you take this guess the J u get will be corresponding to this guess so this is J, this
is w and we can automatically see that this is not optimum, why is this not optimum? Because
at this point dJ dw is not equal to 0. Now if you treat this as a game from this w you have two
choices you can either move to the left or you can move to the right in order to improve your
guess.
Now looking at this picture we automatically know that we have to move to the left, how do
we know this? If you find out the slope at this point, so dJ dw here is actually positive. If it is
positive we know that 0 lies somewhere here, so you actually say w is w minus something,
this something often is written as some alpha (multiplying by) multiplied by the slope so that
if you are here and the slope is negative you actually will go to the right so this is the simple
idea behind gradient descent.
357
Our task is basically to improve our guess for w. For a scalar this is fairly straight forward,
the new guess is the old guess minus alpha times (dJ w) dw where alpha is an arbitrary
parameter, okay it is a positive arbitrary parameter this parameter is often called the learning
rate we will see that once again now.
So now let us take the more complicated vector case. So you have a whole manifold let us
say once again you have J, w 1, w 2 once again you guess we can see that the actual
minimum is here, what is drawn here at the bottom are contours, recollect from our
discussion of multi variable calculus the contours basically are collapsed. So if you think of
this as a series of rings each of which is of constant value, if you collapse all of them this is
the kind of contour that you will get, these contours are what are called level sets basically
lines of equal value.
You would see this also in something like if you see the weather channel you will see this,
you will see lines of constant pressure or lines of constant temperature which are moving
around, so they are a representation of the function. So suppose here is just the contour drawn
so on one line the value of J is constant. So we want to come to this point which is actually
the minimum, but instead let us say I guess somewhere here this is my w guess so somewhere
here is my guess the first guess is here.
Now I want to move in the right direction, now remember moving unlike 1D now is will be
in two directions, okay. So the delta w or the change in w that you have give actually is a full
vector, you have to say how much you want to move in x and how much you want to move in
358
y direction, okay so you have to give both these. Luckily for us we have a nice theorem
which says that we can move in the direction which decreases the maximum.
So for example if you are here you would like to move in this direction where the decrease is
the sharpest so that you can think of this as a ball which is rolling downhill and you want to
go as fast as possible to the bottom. So you would like to move in the direction where the
change in J is the maximum or is the steepest and it turns out that the gradient gives exactly
the direction that we are looking for, okay.
So we will show a quick proof of this actually not nearly at the end of this video but in the
next video but we will show a quick proof of this. So the general gradient descent algorithm it
turns out is a very simple generalization of the scalar case. The new w remember this is a
vector is the old w minus alpha times grad J which is also a w, okay. So you take the steepest
descent direction multiplied by a parameter just to adjust the size of the step and then move
from there, this will become a little bit clearer as we go through a couple of examples alpha is
a very important parameter call as I said earlier the learning rate this is something that we
have to choose.
So let us take an example let us take a very simple function for which we already know the
minimum. So let us take the function J’s w 1 square plus w 2 square plus 4 we know that the
bottom is here the actual minimum is at 0 0, here are the contours which are drawn here,
these are circles these are circles because J is a constant when w 1 square plus w 2 square is a
constant which means these are circles centred at 0, okay.
359
So let us take the gradient of this analytical gradient is simply 2w 1, 2w 2 vector. So the
iterative formula that we get for this is remember w vector new is w vector old minus alpha
times grad J vector, this grad J has two components so which means w 1 is w 1 minus alpha
times the first component which is 2w 1 as I have written before. I have denoted k plus 1 and
k instead of old and new, so old I am calling k, new is called k plus 1 so that we can keep on
iterating so start from the first guess to the second guess and so forth, okay.
Similarly, the second component also works the same way 2w 2 k comes from the second
component of the gradient so this is the iterative formula. We know that the actually
minimum is at 0 0 as we said, let us start with some random guess suppose we give a bad
guess of 3, comma 4 so that on this curve is somewhere here so this is my first guess I want
to go here this is the ideal w star.
So now how we can proceed is by actually choosing some value of this constant alpha this
alpha so we will take 4 different choices just so that you can see a range of behaviours for
what happens. So let us say alpha is we will start with alpha is 2, we will look at 1.1 and 0.5.
Okay so let us start with alpha equal to 2 we are starting at 3 4 let us say alpha is equal to 2.
So let us look at a simple table, so let us look at iteration 0 this is our initial guess, our w was
3, comma 4. I am not putting the transpose each time please understand that I am treating it as
a row vector instead of column vector but works the same way. Now grad J is simply 2w 1
2w 2 which is 6, comma 8, J the cost is 3 square plus 4 square plus 4 remember 3, comma 4
so this now comes to 29.
360
And now we can calculate w k plus 1 which is 3, comma 4 minus alpha which we choose to
be 2 multiplied by 6, comma 8 and if you calculate it, it comes to minus 9 minus 12 so it has
gone far away, okay so we started here and we have gone somewhere outside of the picture
and you can see that this is actually not doing quite well I would like to come here but I have
gone far away somewhere else but let us see how it goes further.
So suppose I start with minus 9 minus 12 and then proceed again using the same formula
grad J is 2w 1 2w 2 which is minus 18 minus 24. If you calculate J, J has actually increased
ideally we would like J to always decrease this does not always happen in gradient descent
but you can see that it has increased tremendously, okay. If you calculate w k plus 1 now it
has come to 27 36 so all the way from here now you have gone somewhere else.
If you see in this picture you were you started somewhere here 3, comma 4 at this point and
then we went somewhere far out, then we went somewhere else so we are actually going
further and further away. So if I put 27 36, I see that my J has increased from 229 to 2029 and
w has become worse. So this kind of process where intuitively we see that J is actually not
coming down but is going further and further away from the actual solution even if you do
not know the actual solution you can atleast see that J is increasing so J our cost function is
actually increasing and we are going at worse and worse places rather than better and better
places. So this is a case which is a divergent case of alpha.
So unhappy with this we try a slightly lower alpha it is always a good prescription to try
lower alpha in case a higher alpha does not work. So once again we start here but instead of
361
alpha equal to 2 if we use alpha equal to 1 we get minus 3 minus 4 which atleast seems a little
bit better. So you started here 3 4 and we came here minus 3 minus 4 you wanted to come
here maybe hopefully we will come back there.
So we now put minus 3 minus 4 the corresponding grad J is minus 6 minus 8, J unfortunately
has not decreased because it is w 1 square plus w 2 square and if you try and find out what w
k plus 1 is which is minus 3 minus 4 minus 1 times minus 6 minus 8 you will get 3 4, so now
you are back here. So now we are sort of stuck in a cycle, it basically oscillates between two
points 3 4 minus 3 minus 4, 3 4 so on and so forth it just goes back and forth, J does not
decrease at all in fact in this case.
So this case is also not useful for us because we would actually like to systematically come
towards the actual minimum so this is an example which does not converge at all.
Let us take a third case which much smaller alpha which is 0.1, okay. So if we go through the
exercise now all that has changed from the previous two examples is the alpha that I have put
which has become 0.1 and you see that this has become slightly better now so you have come
to 2.4 3.2 somewhere here so we have got a little bit better atleast it looks promising and we
can now check what happens as we do future calculations.
You will also notice that slowly now instead of either increasing or getting stuck at the same
point J is actually decreasing. So I would recommend that you do this exercise yourself you
will also see one such example problem being given in the assignments but you can see now
362
that from 1.9 to 2.5 it has come to 1.5 2.0 which is a little bit better once again okay so
getting somewhere here so slowly we are approaching the origin.
Now if you keep on repeating the exercise so this is now the 30th iteration not just the second
iteration so we wrote a code and if you see the 30th iteration you will see that it is actually
getting quite close to 0, the cost has actually come very close to the minimal cost, why is 4
the minimal cost? If w 1 and w 2 were 0, the actual cost would be 4. So you are actually
converging slowly and we have come somewhere here over the 30th iteration.
Now a couple of things are worth nothing here one is that we have not actually come to the
total minimum 0 0 and in fact if you use alpha equal to 0.1 you will never really come there
because it will only keep on multiplying by small factors there is no way that you can get 0 0
out of this. So you are actually only slowly converge, theoretically it will take infinite
iterations in order to get to 0 0 but numerically we know that below machine epsilon it will
anyway stop.
So if you need to find out the absolute minimum where you know your grad J goes to 0 you
might actually need infinite iterations which is why we actually need a stopping criteria, we
need to say something like okay I am happy with two decimal places of accuracy, we will see
how to do that in the next video.
In the meantime, let us look at another alpha, alpha equal to 0.5 by now you should be
familiar with the whole process. So if we put alpha equal to 0.5 you actually get 0 0 right at
the first step. So you start here, we come here. Now what happens to the algorithm once it
363
comes to the right minimum? So if you come to 0 0 note that grad J is also 0 0 because this is
the actual minimum, J is 4 of course and w k plus 1 is 0 0 because it is 0 0 minus alpha times
0 0 so it is just there.
So in all future iterations it will always stay at 0 0, so this is an advantage at gradient descent
because you have w is equal to w minus alpha grad J, the moment grad J goes to 0 you will
actually w will stop there, of course you can have this at a false minimum something like a
saddle point also it can get stuck, but we will see cases of that sort in the coming weeks. The
important thing here is alpha equal to 0.5 actually converges quite rapidly.
So what we have seen so far is that it is possible for the gradient descent algorithm to either
diverge which we saw for alpha equal to 2, or it could oscillate without diverging or
converging, it could converge slowly which we saw with alpha equal to 0.1 or it could
converge quite rapidly which happened with alpha equal to 0.5, okay. In practical algorithms
you will probably never see a case such as alpha equal to 0.5 where in one step you are going
to get to the right answer, okay so that will almost never happen.
But typically you are going to see some manifestation of either slow convergence or fast
convergence. So all this depend on the learning rate alpha, part of algorithm design what you
will have to do as a user is to choose the right alpha. There are methods which have some
variations on this which we will discuss in the coming weeks, but alpha is what is called a
hyper parameter, okay.
364
A hyper parameter is a parameter that must be set before your learning algorithm actually
starts, okay so even before you actually learn you will actually have to set some parameters
alpha is just one such example. In fact an open problem typically in neural network and deep
learning research is what is called calculation of hyper parameters, okay so design of hyper
parameters and coming up with optimal hyper parameters.
In the next video what we will see is some of the details of gradient descent. For example, we
will see a proof of the steepest descent property the fact that the gradient represents the
direction of steepest descent. We will also look at the point that I mentioned briefly for the
alpha equal to 0.1 case which is you need to decide when to stop the algorithm, you will
never get actually to full minimum but you need to decide when to stop, okay.
And the third thing which we have to find out is finding out how to calculate gradients when
there is no actual analytical expression for J available. So these are the three issues that we
will be discussing in the next video, thank you.
365
Gradient Descent – 2 Proof of Steepest Descent Numerical Gradient Calculation
Stopping Criteria
In this video we will be looking at some of the details of gradient descent which we skipped
in the last video. At first we will be looking at proof of the steepest descent property, we will
also look at how to calculate numerical gradients in case an analytical expression for J is not
available and finally we will look at when to stop your iteration or atleast some criteria which
people use commonly for this.
366
Okay, so once again let us look at the kind of functions that we were looking at before with
this being the landscape and this being the contour of the function remember this is J and let
us arbitrarily name this w 1, w 2, okay. So if you are at some point let us say here some w 1,
w 2. You have some value of J. So remember that these lines represent values of J, similarly
here you could look at it in 3D also there is some value of J.
Now you have a several choices from here, you could move in any direction, whichever
direction you move in so if you have if you are at some w and you move to w plus delta w
vector instead of having J you will go to some J plus delta J. So the question is what gives
you which direction gives you the maximum average, in which direction will the rate of
change of w be the maximum?
Now what I have plotted here are actually vectors which are equal to gradient of J, so you can
see that as they come near the origin they become smaller and smaller and at every point grad
J itself points in a certain direction it is a vector it has a magnitude, it also has a direction. So
claim of the steepest descent theorem is some sense is that the direction of maximum rate of
change for a function J of w is given by grad J, so if you take a unit vector in that direction
and you move in that direction that is the direction in which (temperature will change oh not
temperature) sorry J will change the maximum.
So just to give you an example if you are in a room and there is a heater at one end and there
is a air conditioner at another end if you join the heater to the air conditioner in one straight
line so that is where your temperature will change the most rapidly. In all other directions the
367
temperature will change but it will change in a slightly different speed. So the direction in
which this change is maximum turns out to be the gradient let us proof that very shortly.
So remember that if you want rate of change of a scalar which is J, remember we are always
dealing with scalar cost functions in the scores, if you want its rate of change in a given
direction that is given by del J del V vector assume for now that V vector is a unit vector, also
remember this also we saw that del J del V vector can be written as grad J dotted with V cap,
so this is what will give you del J del V vector.
Now since this is a dot product you can write this as mod of G where G is nothing but grad J
vector multiplied by value of norm of V or absolute value of V or length of V multiplied by
Cosine theta where theta is the angle between G vector and V vector this is the simple
definition of dot product. So G is grad J and theta is the angle between the gradient and V.
Now since this is the value of change of J in any direction we can get a maximum remember
norm G is fixed, norm V is also fixed, what is variable is the angle that you give between the
two.
So when is this going to be a maximum? This is going to be maximum when theta is 0, and
when is this going to be minimum? It is going to be minimum when theta is Pi. So this is
pretty straight forward. So when is theta 0? Theta is equal to 0 means V and G are parallel, in
other words when you choose a direction that is along the gradient you will increase most
rapidly and when you choose a direction which is directly opposite to the gradient which is
exactly 180 degrees away from the gradient you are going to decrease most rapidly.
You can see this in this figure also, you see the gradient is actually this is another property we
are not discussing it, it is normal to the contours. So if I am here and if I want to increase
most rapidly, the next value is here, the best direction to go in is directly perpendicular. If I
go here I will have to go further in order to get the same change. So (that is what this property
choses) that is what this property represents, what it says is the rate of change is maximum
right along the gradient and the rate of change is minimum or maximum negative in the
opposite direction, so this is the proof of this theorem.
368
Next we wanted to find out how to calculate gradients numerically? Remember as I said
earlier we do not have explicit expressions for the gradient available in most cases, this can
happen due to one or two reasons, there is actually genuinely no analytical expression at all
for J and it is available only as a black box so this is one case so grad J is not available.
Another case where this can happen is J is actually available as a composition of functions.
An example of that is J is so w runs through a box f 1, it runs to another box f 2, f 3, f 4, up

till so on and so forth an f n and even though you have analytical expressions for each one of
these J is actually very hard to calculate because it is a very very lengthy expression. In both
cases a very very simple solution that you can use is actually you something called the finite
difference method, what is a finite difference method? It is simply our definition of derivative
written down a little bit more explicitly.
So suppose you have J as a function of w 1 and w 2 and you want del J del w 1 you would
say del J del w 1 is w 1 w 2 plus delta w 2 minus J of w 1 w 2 divided by delta w 2, in the
case of w which has n components or what is called n features you simply perturb the
particular derivative you are interested in, so if you are interested in the third derivative you
perturb the third variable so this should actually not be there this is an error this should be w
2 w J here and w n divided by delta w J so you have to perturb each of the variables.
So you can use finite difference in both cases. However, if your number of features or
number of features simply is n which is the size of your w, if your number of features is very
very high this can actually become very expensive because you will have to perturb each one,
369
so if you have let us say 500 features for example as we were discussing earlier if you have a
60 cross 60 image and you have like 3600 you will have to perturb it 3600 times in order to
find out each one of these derivatives so this can become very very large.
A different method which is what has made neural networks practical of late is what is called
automatic differentiation. So this is very useful in case you have an analytical expression but
the analytical expression is hidden as a chain. So suppose you have analytical expression for f
1, f 2, up till f n instead of f n we can call it k so that you do not get confused with this n.
So let us say you have k such chains and k being 100 which is very common and something
like deep learning. So in each one of these cases let us say you have this expression as a
linear expression, this is quadratic, this is linear again it is impossible to write it down and
differentiate it by hand atleast, which is why the computer does the analytical differentiation
and this is called automatic differentiation.
You use either finite difference or use automatic differentiation, and automatic differentiation
is the method of choice in case this is what is happening which is you have analytical
expression but it is a result of multiple chains. So we will see this within the context of neural
networks this is called back propogation when we do this for neural networks and that is what
has made modern neural networks possible ever since the 70’s or 80’s when the algorithm
was first formulated.
So just as a summary in case you really have no other choice and J is only numerical then you
use finite difference but in case J is available as a chain of functions for example f of g of h of
w of something, if you have something of that sort then we use automatic differentiation.
370
The final topic for this video is what is called stopping criteria. So ideally we should stop our
iteration, we are trying to find out optimal w, ideally we should stop it when grad J actually
becomes 0 remember 0 means not just 0 value 0 vector. This almost never happens in practise
because as we saw with the alpha equal to 0.1 case your number of iterations could be
actually infinite, also you cannot give 0 to infinite precision, your machine itself has finite
precision. So because of these two reasons you actually want to stop quite early or a little bit
earlier.
So once again let us look at our previous example we started somewhere and we wanted to go
till the origin instead of using infinite precision so chose some finite precision. I will say I am
happy with five decimal places in which case I will set some precision called epsilon I will
call it standard notation is epsilon to say that I am happy with 10 power minus 5 or five
decimal places of accuracy, but five decimal places of accuracy in what?
So we have multiple options, one thing is our friend the norm returns here, remember w is a
vector so if I am doing iterations so if you recall the iterations we did there suppose some
iteration w vector is 1.012 and 1.011 this is w in this third iteration let us say and let us say in
the fourth iteration this is 1.009 and 1.010 so up till two decimal places we can be fairly
certain that we have got the answer reasonably correct, specially this repeats again and again.
Now remember of course this is a vector, this is a vector how do we find out one vector is
how closed to the other vector? By subtracting them and taking the norm. So if you subtract
them and take the norm typically any norm it could be 1 norm, 2 norm infinity norm that is
371
your choice, it will come out to be a number and I say that the difference between the
previous vector and the next vector should be smaller than the given precision and this
precision could be in this case the example I have given is 10 power minus 5.
Now instead of saying w has to be close to each other I could also say something like grad J
should be smaller than a given number, remember we were trying to do grad J equal to 0 and
instead of saying grad J equal to 0 I could say the moment grad J becomes lower than 10
power minus 5 I will stop. So some iteration of this sort could happen so you could start with
some w and you could end up at some other w which is closer and closer to each other.
Finally, instead of looking at grad J, so this is choice 1, this is choice 2, I could also look at
the cost function itself which is the J for the previous value remember our example we were
going from 29 to 229, etc when it was diverging but it was slowly coming to 4 when it was
converging. So I could look at J of the previous cost of the previous iteration and look at cost
of the next iteration and find out the difference between the two.
So here is a standard figure that is often drawn, this is J, this is number of iterations and
people often visually track the convergence you cannot track the convergence of w because w
is lords of vectors, but you can track the convergence of the cost function. So here it started at
some value, in our case it started with 29 and slowly started decreasing and you can get
satisfied that may be after sometime it will kind of converge some value and you could stop
here. So we can look at the difference in J’s so this is the third choice that we have.
372
So here is a quick summary of the gradient descent procedure, you first have to decide on the
hyper parameter learning rate alpha, you can also decide on epsilon and you can also decide
on what stopping criteria you will use, will you use it based on w, based on grad J or based on
J itself, make an initial guess for w, calculate the next guess, if required you will always have
to calculate the gradient find out either the gradient numerically or analytically whichever
way it is available to you, calculate the stopping criteria.
If suppose the stopping criteria is satisfied let us say J, the difference between J now and J
before is less than the epsilon that you decided then you stop, if not repeat the iteration. So
this is summary of the gradient descent procedure and this procedure will be used with small
variants with minor variants throughout this course especially throughout the deep learning
module, thank you.
373
Professor Dr. Ganapathy Krishnamurthi
Department of Engineering Design
Introduction to Packages
Hello and welcome back. So we will give you a small introduction to various packages
available for implementing some of the machine and deep learning algorithms that you will
learn in this course.
374
So to start off with of course all of you must be familiar with python, it is a interpreter based
programming language that is very popular specially for prototyping and in many cases even
for production level software. So it lets you to represent data types there are several data
types includes numbers, strings, list in the form of strings, lists and dictionaries. It has a
typical programming constructs like for loops, conditionals and functions and all variables in
general are passed by value.
So it has its own scientific computational libraries like NumPy and SciPy and also plotting
capabilities the Matplotlib these are modules that you can import into Python as you program
it, okay so there are various resources available on the web, one of them is mentioned at the
bottom of the page of this slide and you can welcome to explore these options. So there are
again online courses available which leads you learn Python from scratch so that you are
more comfortable programming in this language.
This is important for many of the other packages that we will see in the (())(1:35) slides. So it
is good to have basic capability in Python and get it started before we get to the point where
we do some programming.
So Scikit learn is another module that comes with that can be installed in Python and it is a
module that comes into Python and it requires both Python and the modules Numpy and
Scipy. Scikit learn has a lot of the machine learning algorithms already implemented machine
learning and computer vision algorithms available as part of it which we can just call like a
function.
375
Once again Scikit learn comes with some excellent manuals and documentation and example
code which you should try out, okay. So this is Python based so if you know programming in
Python you should be able to use Scikit learn without any difficulty.
Okay, now we move on to some of the deep learning frameworks, so why do we need them?
Probably because it helps you to easily implement in prototype deep learning algorithms,
okay. So in general coding these algorithms is scratch even though it is a good exercise, it can
distract you from your primary purpose, we expect that most of you are working in some
engineering domain where you want to solve a particular problem and not necessarily solve a
programming problem in this case implementing deep learning from scratch.
So there are multitude of solutions available, I have listed some of the more popular ones
here, so PYTORCH, Tensor Flow, Keras Keras comes on top it is an API of Tensor Flow,
cuDNN is offered by NVidia, Caffe and mxnet. So when some of these are offering some
large companies with as freeware as open source software. So we are welcome to adapt any
of them for your course, for your assignments in order to learn the concepts that we will
provide you in this video.
376
So if you look at Tensorflow it is an open source library which is offered by Google, it is

used for setting up data flow graphs. So for any neural network which you will see in the
following weeks in deep neural networks you will see in the following weeks can be set up in
the form of a graph. So Tensorflow works in the following fashion you assemble a graph, you
basically define the nodes computational nodes in the graph and then you invoke a session to
execute the computations in the nodes, okay.
So it is a very nifty tool for implementing deep neural networks, it is also very convenient
because a lot of the more popular deep learning algorithms that you see (())(4:09) algorithms
like (())(4:10) some architecture that works very well, many of the implementations or the
architectures are available in through Tensorflow so it is free to download and you can learn
to code just by looking at those examples.
377
Once again Keras is the official high level API for Tensorflow, again it greatly reduces the
programming complexity involved because if you start coding even with Tensorflow even
though Tensorflow makes coding easier specially for implementing some of the deep learning
algorithms. Keras provide one more layer on top so that gives you a very simple way of
implementing some of the more popular architectures or more conventional architectures that
we see in deep learning.
So you will see support for sequential models for cuDNN like convolutional neural network
like models and regular neural network models, okay and it provides you this functionally
API where we can call functions at different points in the code, you can move them around
like this Lego bricks.
378
So for instance here is a very simple script for implementing a typical sequential model
neural network and see that it is accomplished in a very short piece of code, so some of these
terminologies may not be very familiar to you but what is going on here is that it lets you
define some of the what we call the computational aspects of some of the neural networks are
defined very easily using this model, okay.
So even if you Keras even a very complicated neural deep learning model can be
implemented in a very few lines of code it is very transparent. So that is one of the reasons
why you would need some of these packages, right.
379
We will move on to the next one which is PyTorch once again PyTorch, Tensorflow both are
very popular among the deep learning community, both of them offer a lot of features that are
very say very good for prototyping and both of them also in fact PyTorch also offers a lot of
the more popular deep learning architectures already coded with the weights, etc we will see
what this means later on, okay.
And PyTorch again gives you combine seamlessly with Python. So if we learn Python,
PyTorch becomes very easy that is another advantage. In terms of capabilities again PyTorch
and Tensorflow are quite similar in terms gets you core some of the deep learning algorithms
that you will see in this course. Again they define variable slightly differently but rather than
that again PyTorch is as good platform to start with as Tensorflow again and PyTorch is
provided by Facebook as one of their open source implementations of deep learning
algorithms.
So rather if you like to work with C plus plus when you like to code that way then Cafee is
the package for you it is again a open source package from I think Berkeley and again it is
written in C plus plus, initially done supported lot of convolutional neural network
architecture but now I think it is branching out into deep learning (())(7:25) also. Once again
this is again an open source version lot of support on the online community you can go online
look for you know if you want to code something up how do you do it? there will be some
support community there for Caffe and if you are some person who likes coding in C plus
plus then probably this is the I think for you.
380
Google Colab, again this is not exactly like the package that we talked about but Google
Colab is, I would like for you to check it out, it provides you free cloud service in fact gives
you access to free GPU’s and what they call TPU’s tensor processing units, it also supports
PyTorch, Tensorflow, Keras and other open source software packages that you can use for
implementing deep learning models.
Once again the attracted aspect of this is the availability of free computing power so from
your very you know (())(8:13) laptop you can actually run it is slightly sophisticated code, so
if you are programming in Python and want to try little bit more adventurous let us say you
are using PyTorch, Tensorflow then this might be a good option for you, and again Colab is
just the cloud computing architecture and not exactly a package.
381
So finally we will come to MATLAB which is what we will use in this course primarily
because MATLAB is providing us with most of the providing the students who are enrolled
in this course the free account that we can log into so it is MATLAB online so you can it (())
(8:46) you the interface which is very simpler to the MATLAB desktop and you can try may
of the deep learning and machine learning algorithms that we cover in this course, okay.
Another reason for choosing MATLAB is that we expect a lot of the students to sign up of
this course to come from a variety of engineering disciplines and I am sure most of you are
familiar with MATLAB which is the programming platform of choice for engineering
students in general both in research and academics and as this one in the class room. So we
will be using some of the tool boxes in the more recent version of MATLAB this is 2018b.
The computer vision tool box, there is a statistics and machine learning tool box as well as a
deep learning tool box I think the deep learning tool box was formerly refers to as the neural
network tool box. So it is being around for some time but now there is (())(9:37) interest,
okay. So you are interested in machine vision problems or computer vision problems there
are some handy algorithms already available that we can use again the non-deep learning
machine learning algorithms are also available for you to try and test out, they also provide
you some datasets which you can load in and play with those, okay.
In addition, the deep learning tool box offer support for very easy to use interface for
developing convolutional neural network architectures LSTM things like that.
382
What appeals to me personally is that they have a very nice module for accessing medical
images so many of the medical images that are generated or stored in something called a
DICOM format in MATLAB allows you gives you the software routines to read those images
into the memory, okay. So once again a lot of the post processing that you will do on images
we will see later on in the course is also available already coded in the form of some function
easy to call functions in MATLAB specially for medical image analysis.
Again image registration if someone who works in image processing then it has a very nice
tool box image registration tool box or setup routine for doing image registration which again
is a integral part of medical image processing.
383
Once again there are other resources for conventional machine learning algorithms, okay they
are listed here I will not go through all of them but what you are trying to say is that many of
them have are command line interface, okay so you can do them from the MATLAB
command line and for this course we will be given access to online MATLAB license and
you can try this commands in that browser also. So that is a very convenient thing that
MATLAB allows you to do.
In addition, they also have the nftool which actually provides you the GUI for creating deep
learning algorithms.
384
So we will look at some of them so for instance if you load up enough tool you will come up
with a interface like this where you can define your deep neural network. So for instance you
can say where the data is in this case this is the dialog that leads you load data which you
have already stored in some format, okay. Once you have loaded the data set then you can
divide it into training, testing and validation so we can give you it automatically tells so many
samples there are and how many do you want in terms of training and testing and validation
data.
385
And then of course you can also use the same dialogue box to define neural network
architectures that how many intend layers etc. Again if you start with these terminologies are
not familiar to you just have to wait another week or so and you will be fine. So we just go
walking you through some of the more easier aspects of using MATLAB here. And again it is
very simple to choose the training algorithm which is again you are familiar with one by now
gradient descent is the training algorithm that you have learnt right now. So here it leads you
some other optimization technique as a training algorithm.
On top of that you can see the performance in the training, in the form of graphs it tells you
what the training and testing validation accuracy are, define various hyper parameters,
monitor them, etc using the same GUI.
386
Once again for deep neural networks what we saw till so far is the conventional neural
network, for deep neural network again lets you access the various layers in a network in the
form of this module that you can drag and drop to create your own. So once you created that
it generates code automatically that you can go and edit, so that is the idea behind using this
GUI for creating deep neural networks.
Again I will not go through all the details but what this GUI shows you is that you can define
so you can define different layers here and you can examine what each of these layers are on
the right hand side tool box. So it is like drag and drop that allows you that functionality,
okay.
387
So again transfer learning thinks we are too far ahead at this point so I will not really go in
detail in this particular slide but transfer learning is allowed in the sense that you have a lot of
the large popular convolutional networks and deep neural networks already available as part
of MATLAB which we can reuse in order for your limited data set. So in this course we will
we expect you to use MATLAB primarily because MATLAB is supporting this course, it
supports this course by providing all of the resistance access to online MATLAB license
which provides you the interface very similar to a desktop interface.
It also provides you 250 megabytes of hard drive data so you can upload your own data, of
course not very large data sets but reasonably large for most of the task that you will be doing
and they also provide you support in sense that if you have trouble accessing etc they provide
you support for that also. So it takes away a lot of the troubles you might face if you were to
find figure out entire platform for yourself having said that, we are platform agnostic to some
fat degree and we are okay in fact we do personally we have used many of the different
software platforms.
So we are okay if you want to pursue any of the other packages that you have mentioned, but
please note that it is difficult for us to provide support in any form if you have any trouble
with the platform itself, okay. So but MATLAB in this case can provide you limited support
if you have trouble with that platform, so that is the possibility. So I hope that we have given
you a broad overview of the different options available for you to learn to code, we live it to
you to choose the best option whatever suits you.
388
Please remember that there is nothing wrong in starting with MATLAB because it is easy to
start that way provides you the lot and specially if you are familiar with MATLAB already it
is okay to start with MATLAB learn and because many of the programming concepts are
similar across the board so it is not too hard to switch to other platforms in the future if the
need arises, thank you.
389
The Learning Paradigm
Welcome to week 4 of machine learning for Engineering and Science Applications. For the first
3 weeks we were looking at some mathematical and computational preliminaries. This week we
will actually start our excursion into machine learning. So before we do that I will be introducing
you to the basic learning paradigm that will be used especially for the deep learning module. You
might recall that we had split the course into essentially four large sections, the first 3 weeks
were the mathematical preliminaries, the next set which will be continuing until approximately
week 8 would be deep learning, which is the most popular machine learning algorithm or the
machine learning family in use right now. Of course we never know which algorithm is going to
make strong come back. So we are going to concentrate on other machine learning approaches
after we finished deep learning and finally will look at some advanced algorithms in the last
couple of weeks. Now the paradigm that I am describing right now in this video is particularly
applicable to the deep learning set. So you will see that the various models that we look at or the
various algorithm we look at will have a sort of standard template. So what I am describing right
now is the general template so that you see the pattern when we repeat it. Often it is easy to lose
390
yourself into the algorithm without seeing the bigger picture. So what I am describing right now
before we get into deep learning is the overall paradigm or overall template that we will be using
here.
Before we go to the deep learning or the learning paradigm I would like to talk about some
minute changes in the syllabus as announced in the course at the beginning of the course. So we
have some minor changes. Previously we have plan to introduce applications at the end of each
week so we finish a week of lecture and then talk about some application of this as we step into
the deep learning. However this had several implementation problems especially due to the time
that it takes usually to complete the weeks lectures that itself is around two and a half to three
hours and if we start discussing applications would have taken much longer. So what we have
done now is we have removed some really advanced portions which we had kept as week 11 of
this course which is Structured Probabilistic Models on Monte Carlo methods, anyway you can
do this as part of other courses. Instead we will now devote one full week this will be around
week 8 or 9 or week 9 or 10 of this course and we will have one whole week after we finished
the deep learning portions will discuss all the possible engineering applications that have
especially come up over the last few years. There have been some old applications but especially
new applications we will be discussing you will find some surprising instances of this in week 9
and 10 and we hope that you will enjoy that. Will have a concentrated discussion of applications.
Other week lecture will of course cover the primary theory and the primary computational ideas
391
that you need in order to implement deep learning and we will take some standard examples and
some engineering examples but the heavy load of applications will be relegated to week 9 or 10
of this course.
So in this video we will be discussing the following ideas, most of deep learning uses one simple
learning paradigm or one simple template by which it learns.
392
We will now look at how it applied to linear regression logistic aggression. I will give you a brief
idea and then we will get into the details of this. So recall we had looked at this in week 1 itself,
the standard machine learning paradigm is different from what happens in classical programing.
In classical programing you give some data and you give that rules for manipulating data and
that gives you the final answer that you require.
On the other hand in machine learning, we actually give the input data, we give you the final
answers that are required, and you are supposed to figure out what the rules are. This kind of
rules is what I call model. Model means, basically how is the input related to the output. So if I
speak and you are hearing words even though all I am doing is making sound waves in air, how
is it that you are translating the sound waves into some words, that is a model. So that is a model,
in data terms it is basically you take a set of numbers and turn it into another set of numbers
through a map.
393
So now let’s look at this in some more details. So let’s look at now specifically the machine
learning paradigm. So what you have is an input data, you also give the output data, and what
you figure out are what are called parameters or weights. Please notice this word this will be
recurring for a long time until we go till week 9 or classical deep learning is all about learning
the parameters or the weights. So our task is to learn the relationship between these two. This is
what we don’t know. So if we see a bunch of pixels and we identify this as a human being, what
is it that is happening there, that is what we want to learn.
In class in engineering or science terms, we want to know the model. We want to know the
model for you know why is it? What is the relation between today’s temperature and yesterday’s
temperature? So you can treat today’s temperature as y, yesterday’s temperature as x and you
want to know what is the relationship between these two. That would be a model. So we are
using the same term as big umbrella term as a model which relates one thing to the other thing.
So for now we will think of this relationship as if it is a function is the standard model we will be
using till we come to the end of deep learning. We call this function, that relates x to y as the
model or the hypothesis function. Remember in science, we postulate or we have a hypothesis,
this is probably what is happening. So you have a hypothesis something like the relationship
GmM
between the two forces of earth and sun is . So that it is a hypothesis, a mathematical
R2
394
function. Now a certain part is that, every function has two parts. So let us take one function. So
let us say x is a vector and it has two components x 1 and x2 , and I have my function y is h( x) and
is given as h( x)=w 0 + w 1 x1 +w 2 x 2 +w 3 x 1 x2.
Now this function itself has two parts. One is the form of the function and another is the
parameters of the function. Form is the fact that the first term is a constant (1), the second term is
linear ( x1 ), third term is linear ( x2 ), and the fourth term is non-linear ( x1 x2 ).this is called the
form. The parameters are these w 0 ,w 1 ,w 2 , and w 3. So when you have some x and some y, you
can have infinite number of forms, such as near, quadratic, cubic, and exponential whatever it is.
You also have in each one of these, you have some unknowns, are called the weights or the
parameters or sort of the knobs that you turn for this function to look different each time.
If you have a linear function w 0 +w 1 x , you have two weights. If you have a function
w 0 +w 1 x 1 +w 2 x 2 +w 3 x1 x 2, you have four weights. So then we say that machine learning learns
and what it give out is actually the parameters or the weights. So within the deep learning
module this is the only thing will be doing. We as users already give the form. Suppose I say
x1 , x2 , and x1 x 2 is the form of the function, machine learning is not smart enough to try x1 x 22. It
cannot try any such thing, it can only try within the limit that you have given it. So user defines
the form.
User or the programmer or the machine learning engineer gives this form and the parameters are
found by the algorithm. So when we look at the learning algorithm all it learns are just the
parameters. Now how do you decide under form? That usually requires domain knowledge. That
is you need to know what it looks like. So if you have the relationship between the voltage and
the current, you kind of know from intuition or from your physics knowledge that it supposed to
be linear. Somebody working in solid mechanics stress and strain, they know it kind of bilinear
or maybe a small variation on that.
So the form is usually decided by what you already know from science or from engineering
about the function and the parameters or the weights are given by the machines. In some sense,
sophisticated curve fitting is what is going on in much of learning machine. Now this modeling
395
process which is going to figure out this parameters or weights, involves two separate processes,
one is called feed forward and one is called feedback.
Let us look at feed forward or forward modeling. Let us say you have a process of this sort (ref
slide time: 10:45). You give x, you also have to give the weights and what comes out is y
y=f (x ; w). So model or a hypothesis as we discussed in the previous slide, is simply an
educated guess at what the relationship between the input and the output is. In some cases it is
kind of obvious and some cases it has not. So you usually have two pieces as I mentioned just
now. You have the form which is linear, quadratic, or exponential etc, and then you have the
parameters which are unknown constants or adjustable constants that are sitting there.
So we sometime use the notation y=f (x ; w), whatever comes after the semicolon in this
notation basically means it is a parameter. Now notice that whenever I give you x and some
choice of w, you can find a y if you are given f. If you are given the form of the function x and w,
then y is fully determined. This process of going from x w and coming out with yis called the
forward model. This process is sometimes also called feed forward. Feed forward is nothing but
putting x and w inside this box and then getting out y.
396
Now however, as I said earlier what we are interested in knowing is w. So this is the way it
works. So let us say, you collect lots and lots of data, x and corresponding y. Suppose we are
doing a voltage and current thing, so you do one experiment with a particular resistor, find out
voltage and current (V 1 , I 1 ), do a second experiment and find out voltage and current (V 2 , I 2 ), so
on and so forth to collect lots of data. So we will call the input as usual x and output as y.
Now I am also going to use the term ground truth, which is what came out of experiment. A
standard machine learning example is finding out house prices in a particular area. So you are
guessing that house prices in that particular area are costs let us say 1500 rupees per square foot.
But you will actually have to go to the house and see, measure the area, and measure the house
price. The price might actually be different from what you guessed, the owner might say no I am
going to charge you 2000 rupees per square foot. Ground truth is what actually exists out there in
reality. Similarly if a person is doing a radiologist or somebody is trying to identify cancerous
tumor. You go to the radiologist, show the images, the person will look at the scan and say which
person has cancer. That is the ground truth.
Now you have your model which is separate from the ground truth, It is going to do a
mathematical process. We are going to use a hypothesis function h, that is, you do not know how
the radiologist came up with cancer or no cancer, you are going to guess a function, and for that
the user has to give a form. For example we say that h( x)=w 0 + w 1 x1 +w 2 x 2 is a guessed model.
Now we do an iterative method just like we did gradient descent. So we initially guess for some
397
arbitrary value of w. Now once I know x, w, and I also know the form of the function, you will
get some h( x ; w), we will call it ^y in order to distinguish it from y. y is experimental truth and ^y
is our guess. So what happened is, x is fixed, w is guessed, y is fixed because that is the truth,
and ^y depends on h. So our guess gives us some ^y and the ground truth is something else. So
you are going to find out that there is usually a difference between the ground truth. There will
be some difference between the ground truth and our hypothesis. So what we do is cleverly we
say, we define a cost function. So what the cost function does, is that it tell you when ground
truth was 1 and you said 0.8, so it is going to cost you to have this difference between 1 and 0.8.
So you can have various different cost functions. So once y and ^y are given you can actually
find out cost (J ) which will be in some sense a difference between y and ^y .
In general what is supposed to happen is if y is equal to ^y , J should be zero. So dependent on the

difference between y and ^y we define a J. Now what we do is we find out that w which
minimizes that cost. So this is the general idea, this is a very important idea so the learning idea
of finding of w is now reduced to a optimization problem. So we use some procedures such as
Gradient descent. For example suppose we know J, then you can keep on iterating and find out
what is the optimal w which minimizes J. So this idea we will use again and again. So just to
summarize, h is user defined ,x and y are data, you guess a w, run it through h, and it gives you a
^y . Now you find out y and ^y are different, it gives you a cost. Then once again using gradient
descent or some such method you improve your w. This process of going from J to an improve
w is called feedback. This is very-very similar to gradient descent. I have written gradient
descent there because as far as this course is concerned will only be looking at gradient descent
and its variants, some variants will consider the next week but will look at gradient descent and
its variants for this course.
398
Now all this process tells you is that what all do you have to provide us an engineer. So first of
all you have to decide what is an appropriate x and what is it appropriate y for this problem. So
that is an important part as I have said much earlier in week 1 itself. Remember all problems any
problem even a seemingly qualitative problem can be kind of reduced to a data problem and all
solutions as far as we have concern will be shown as functions or maths. You have to choose an
appropriate and usually a large data set. A lot of this is interaction sometimes you might choose
some nice x and y for the problem some nice input and output vector but you might not have it
available. So you might have to do something a little bit more clever in order to find out x and y
so that you need a large data set. So finding out a good data set is a very-very important part of
the process. Now you need to decide on some appropriate form of the forward model. One
example is the linear model which I have shown and which is the first case we will be
considering, So if x is a vector [ x 1 , x2 ,... , x n ] , then ^y could be ^y =h( x , w), which is a simple
linear model. The fourth thing you have to decide is what is the form of the loss function. One
simple loss function called the least square loss function, is simply ( y− ^y ) 2. And finally you have
to decide how you are going to do the optimization. Like I said we will be sticking to gradient
descent and its variants. An important thing to note is that you will have to give the hyper
parameters α etc. Machine learning is not magic, there is no magic going on here, it requires a lot
of input. For example, as we have discussed earlier this might require domain knowledge. Neural
nets try to take this away, but really even there it is as we will see later on in the application
things, it is usually very-very useful if you have some amount of domain knowledge.
399
So let us look ahead at what will be doing in deep learning, specifically various different types of
hypothesis functions. Now that we know that all that is required in order to learn w is to have a
hypothesis and to learn it. So each of these hypothesis have their own purpose and there is a
specific domain where there will be work, this is like any other model. So you can think of what
we have what we will be discussing as various tools in your tool box. So just like you would not
use a screw where something else is required, where a screwdriver is required. So you can have a
hammer, you can have a screwdriver, you can have a nut, you can have a bolt, all this could be
sitting in your tool box sometimes a spanner. Now each one of these are different and it is your
intuition and your knowledge of the domain that lets you use an appropriate thing at an
appropriate place. So what we will be doing very often is, tell you ‘ok the use of a spanner is
this, this is what it will do and you have to find out where it applies’. Similarly will be discussing
various models each of this models is a different model, like linear regression is used for simple
fits curve fits, logistic regression is used for classification problems, and deep neural networks
are used for almost any general non-linear problem but sometimes it might be overkill.
So the reason why deep neural networks work and probably why many people are in this
particular course so is that we know that any function can actually be approximated by a deep
neural networks or even by a reasonably shallow neural network. It will approximate any
function at all, so we look at this next week, then there are something called CNN’s which will
400
be week 6 and 7 of this course, this is for vision based problems. Then you have recurrent neural
networks for time series sequential problems.
So all this are like I said different tools in your tool box will be going through sequentially what
the properties are of these. There are also appropriate loss functions for each and every case it is
also possible this important warning for you that none of this might be the best models for your
problem. It is possible that you might have some good idea of what the model looks like and still
the overall paradigm that I am describing in this video actually works for you. So as long as you
have some idea about the problem use as much knowledge as you have. You will come to some
concluding remarks towards end of the deep learning section on where to use it and where not to
use it, but please do remember use as much knowledge as you have about the fundamental
problem that you are trying to solve usually that is a good idea.
So why we are going to look at each of this models will be specifically discussing the following
aspects. So what is a mathematical expression forh( x ; w)? This is the first thing that will be
discussing. So linear regression for example is, simply if you have w and x a linear model
h( x ; w)=wx+b and logistic regression is h( x ; w)=σ (wx+b), (σ is called sigmoid function, will
be looking at details of this later this week this is a nonlinear function). You will see that it looks
similar to linear regression and this is the usual process will follow. A linearity followed by a
non-linearity, that is logistic regression.
401
Next thing will be looking at is what is a good loss function? Remember your y and ^y will
usually be different and you want to find out how do I tell the machine that I have not got the
answer that I want it and that is usually through the loss function. So the loss function if it is non-
zero tells you that there is a difference between your hypothesis and ground truth. So a least
square function is ( y− ^y ) 2 and binary cross entropy is − y ln ^y −( 1− y ) ln ( 1− ^y ), this looks a little
bit more complicated but it is actually a fairly simple function. then we will be looking at what
are efficient ways of calculating the gradient, since you are going to get a gradient when we do
gradient descent then we do optimal parameters search. So for neural networks the algorithm
which we use is called the back propagation algorithm. So in the next video we will be starting
our discussion with the first model for this course which is linear regression the linear model,
thank you.
402
Machine learning from Engineering and Science Application
A Linear Regression Example
In this video we will be looking at the first model that we have for this course which is a linear
model and we will be doing a simple regression, many of you would be familiar with this even
from school days and even college days. But we have slightly different take, though initially look
very very similar, later on we will see how this fits into the whole machine learning idea. So
please do not be careless with this, because once you understand this idea a lot of deep learning
automatically become accessible, because most of the issues that occur in deep learning do occur
even in this very simple case. So it’s actually an advantage if you already understand linear
regression from before.
So remember that we were discussing two problems in supervised learning back in the first
week, there is classification and there is regression. We will be looking at regression which is
simply the idea that you have lots of data points and you want to predict y at a particular x which
is not yet available. So if you have x new you want to predict y new.
403
So let us take a simple example, this example is about the thermal expansion coefficient. So the
thermal expansion coefficient as you know metals expand as you increase the temperature, but
this thermal expansion coefficient is actually not a constant, it even varies with temperature. So
here is some data about the thermal expansion coefficient of steel with temperature. This is taken
from a source where this thermal expansion coefficient is actually given in Fahrenheit instead of
Celsius. so you can see a reasonable amount of variation between 6.47 to 2.45 for large a
variation in temperature. Remember this is temperature in Fahrenheit, which is by you get minus
340 etcetera, in Calvin you would not get that. So the question is let say we want you have all
this data in a tabular form and you want the thermal coefficient at some intermediate
temperature, let say at 70 degrees. There are several ways of doing it, a very way is to simply say
I will interpolate it will be between 50 percent of temperatures 8 0o F and 6 0o F . Suppose you
want to use all this data and see if you can come up with a better module. This is usually what
happens in data science, we might have one or two data points but linear interpolation might not
always be a good idea. So we will see a simple demonstration in MATLAB and see what a
regression solution actually looks like for this problem.
404
What you are seeing here is my account in the MATLAB online website, its accessible to all of
you, you are welcome to use this same thing, we will also be sharing a copy of the code that I am
showing right now, so that you can try it out for yourself. You are not constrained by this, you
can also try the same thing anywhere else in a JUPYTER notebook. You will see that the format
of what’s called the MATLAB live editor is very very similar to the JUPYTER notebook if some
of you are familiar with JUPYTER notebook while using python.
So I have this example of linear regression for the problem that we just saw on MATLAB, the
same data that I showed you is here, it is actually taken from the source. (The source that I am
showing here is an excellent source for both data as well as general numerical algorithm, I would
highly recommend that people who are interested take a look at nm.mathforcollege.com.)
So now here is the data. I have coded this up here this is the MATLAB code, we will be looking
at various different ways of doing it as we go on with the course. Once again you have free to
use any programming language that you would like, it’s just that as you can see both for
demonstration and for initial testing out MATLAB is extremely convenient. So we look at this
all I have done coded up this data right here. I will now execute this code. Now α y and T x which
are my x and y coordinates, T is for temperature α is the thermal coefficient those are now
placed. Now we can look at what the data looks like.
405
You will see now that MATLAB has plotted the data and you see x is temperature y is the
coefficient and there is a slight curve. Now we don’t know what sort of fit is good for this kind
of data, is it a linear fit which is good, is it a quadratic fit which is good, or is it a cubic fit which
is good, and now it turns out that we can systematically try this out. I am going to use certain in
build MATLAB functions. There are several options for this specially when we come to later
parts of this peak, I will tell you what other options exist in MATLAB which are inbuilt
functions. We will also be programing this from scratch, because this is a simple thing that
we can program from scratch using gradient descent, will be showing examples for that too. But
from now just to get a physical picture of what is happening, I will just use some in build
function within MATLAB for you to see what its looks like.
So what we are going to do is what is called Polynomial regression. All that means is I am going
to find out remember the best fit. If I fit a line for this data all of us know that the line is not
going to fit properly, it’s impossible to fit all of these with a single line. But once I try a line I
could try a quadratic curve, I could try a cubic curve, I could try all that for this set of data. So
we will first try that step by step.
406
There is function called polyfit within MATLAB. We are going to try the following three
equations, remember these are now our hypothesis equations. So I have written there mean
exactly the same form that I have showed you in the previous video, which is y=w 1 x+w 0, you
can think of this as y=w 1 x+w 0 x0. So y=w n xn ; ∀ n=0 ,1,2 ,... n, that’s basically what we are
trying. This is a linear fit. We will try a linear fir, we will try a quadratic fit, and we will try a
cubic fit. The coefficient are stored in these three variables c o efficient 1, c oefficien t 2, and
coefficien t 3.
Polyfit simply does the order of polynomials. So this is a first order polynomial. So let’s run this.
We would like to know how good is this fit run this and we will fit it for this whole thing.
407
And now we see that the linear fit looks like this, it’s not unexpected in some sense this is what
is called the best fit line, remember we were trying to find out the optimal ws. So what polyfit
has done is indeed found out this optimal w, it’s kind of a fit weather it’s a great fit or not we
will find out. But remember, suppose I am predicting at 70 I will predict my value somewhere
here using this hypothesis function that I have just found out.
We will now try the quadratic, the quadratic is now the red curve you will see this fit’s this much
better than the linear. Does it’s always happen that the quadratic fit is better than the linear, you
will see later on that this is not necessarily true. But in this case it does happen that the quadratic
408
fit was better. So it will give a much prediction for the thermal coefficient of expansion at 70
degrees. We can now try a cubic fit also. Here is the full data summarized, this line here is the
linear fit, the red line is the cubic fit, and the black dotted line which is almost indistinguishable
from the red line but you will see at the end that the cubic is actually better.
So, for the data set that we were given we find that the cubic fit looks much nicer than the
quadratic fit and the quadratic fit looks much nicer than the linear fit. Our looks always
important we will see that’s also not necessarily true in life or in machine learning that it’s not
necessarily a good idea to makes something that fit so well, in this case it is actually a good idea
but we will see how to formally calculate whether it’s a good idea or not later on.
So we can also calculate our predictions. So you see different predictions here, the quadratic and
the cubic fit predict about 6.4 ×1 0−6, where as the linear fit as we saw predicts much higher and
that about 6.7 ×10−6. So in the next video we will actually see how this fits work, we will also
see later on later videos how to use the linear module tools at from MATLAB but that’s not
relevant right now. But we learn how this fits actually work what is actually going on behind the
scenes when we do poly fit or when we do any kind of linear module fit so that will be see in
future videos.
409
Machine Learning For Engineering and Science Application
Linear Regression Least Square Gradient Descent
In this video we will be looking at some details of the linear regression. We had seen a simple
plot obtained through MATLAB for a linear fit, a quadratic fit, and a cubic fit in the last video.
We will look at some details of how to do this, please pay attention to the process that is been
shown here. Because this is essentially the process that we will be repeating for almost all of
deep learning, specially for the deep learning module as I said the paradigm is set by what we do
for a simple linear fit, and we will just continuing that for quadratic, cubic and then neural
network etcetera, and even for classification problems.
So here is the example the we saw last time. Last time we looked at temperature versus
coefficient of thermal expansion, and we had all this data on the x axis let us call this x and we
have all this data on the y axis let us call this y. So this y as I said earlier is call the ground truth,
this is basically the experimental truth or reality that is available to us. What we would like to
know is what happens in between that is the classic regression problem we would like a fit for
this data and we saw three different kinds of fit last time, one of the fit is was a linear fit. You see
that actually this line which we can call ^y , which is a function of x, called the hypothesis
410
function of x. In Fact we had hypothesis this to w 0 +w 1 x . So this is ^y versus y, for the same x
you have the real prediction and you also have the hypothesized prediction. So we saw that there
is a difference between the two, but nonetheless overall trend is captured by the hypothesis. So
that was one of the thing that we saw last time. We also saw that if you put a quadratic fit, in this
case let say this is ^y , so quadratic shown in red here that is reasonably better then linear and
cubic which is merely better than quadratic, almost indistinguishable. so we had all this different
fits that we had for the same set of data. Why do all this fit differ? because our model h( x) for
what y is like which we call ^y is actually different for each of these cases. So we had linear,
quadratic, and cubic. So for example for quadratic we had w 1 x+w 2 x 2 so on and so forth.
So what we will see in the next videos is how do you actually come up with the coefficients. So I
just use some in build MATLAB function, now we are going to do it from scratch in the coming
videos.
So let us look at this general problem, just like the previous problem you have some x and you
have some y, and each of these data points we can label as data point 1 data point 2 so on and so
forth, and let us say there are m such data points. So we have single input like the temperature
and single output like the thermal coefficient. So let say we take this data pack ( xi , y i )and we
called it the ith example, why example because later on we will see for images I will say this is an
411
example of cat, this is an example of dog, so each images also call an example, this is simply
machine learning terminology you can called it data point.
So ith example simply means ith data point. So in this figure let say I think there are
approximately 51 more points, so you could start with ( x1 , y 1 ) and go up till something like
( x51 , y 51 ), something that sort. We have all these points and what we would to see is which
hypothesis function fits these the best. Remember this terminology will be using a lot of letters
and within our course m simply means the number of data points or number of example that you
have, please remember this.
Now let us look at how do you do the fit? Here the input output is actually the given data and ^y
is our model above and beyond that the data is given. so you have x 1 and y 1 but your guessing
what ^y should be for the given x. Remember like the example, we gave x was the temperature
and y was the thermal expansion coefficient, this is the actual and you will guess something else.
We are going to introduce our firstt for this course model hypothesis, this is a simple very very
trivial linear model, but it is enormously powerful as you will shortly see in the next couple of
videos.
Now there is only one question, we have already fix the form of the function and as I said in the
previous video we still have the parameters. The parameters are unknown, what w 0 and w 1
412
should I fit. Now obviously for different choices of (w ¿ ¿ 0, w 1 )¿, even in each case even though
in each case your hypothesis will look like a line, it is going to look like a different line
depending on what (w ¿ ¿ 0, w 1 )¿ you fixed.
So suppose somebody randomly give some value of (w ¿ ¿ 0, w 1 )¿, here is the original data and
here is your hypothesis. This is the data and this is your hypothesis (ref time: 5:58) or the model
based on the model parameter some model parameter that we choose. now it doesn’t look like a
very good fit intuitively it doesn’t look like a good fit. So somebody else another person gives us
a slightly different model and this looks slightly better than this. Again we have an intuitive
notion of what is better we will formalize this in this very slight. So this looks better than this
because all that has change here is w 0 and w 1, remember all three are still lines.
Now this one looks really good (ref time: 6:28), this one looks much better than this (ref time:
6:30) also much better than this (ref time: 6:32). So the question is, is there any way in which we
can formalize or quantify (remember this word in machine learning we are always looking at
quantitative things, we are looking at number so the machine only recognized numbers) why this
is better than this or this (ref time: 6:51) and the idea goes back to an old idea which we have
which is a cost function.
So what is a cost function for this? It is simply one simple number that will tell you how good
the fit is. So how is it going to say that? now for each point, let’s say I take this x, there was a y,
and there was also a ^y . so there is a difference between the two. what do I do? I take the
difference between the two, square them and then add it. So let us say this is y 1 and this is ^
y 1,
this is y 10 and this is ^
y 10. So for the same x, you find out the y and the corresponding ^y , square
it, add all of them. (How many example do you have? you have i equal to 1 two m). So you have
51 such examples, we sum up all this squares, this is basically sum of the square of the errors and
you say whichever line or whichever choice of w 0 and w 1minimizes this total error I am happy
with that. Notice that no line is going to fit all of this perfectly, you cannot drive this J to 0,
because no line will fit all points but overall, you know overall little sort of kind of split the data
so that you do not go too far away from the line. So this is how we achieve our optimal w. So we
say that the optimal w (so you can now notice it has now become optimization problem) is the
one which minimizes this net cost function.
413
1 1
Now couple of things, i have put here, this is arbitrary, even if you remove this the
m 2m
minimum will be the same. But this m is often used because, you would like mean of squared
error for several reason, one is of course to avoids some kind of overflow errors you some time
just take a mean, another thing is this 2 is also arbitrary, but it is put there just show that when it
differentiate this function, the two and this two will cancelled out. This fit is called the least
squares fit. So the ws that we get at end of the process will be called least square coefficients and
this cost function is sometime it is called the least mean square cost function or the mean square
cost function some time it is called LMS.
Now that we have reduced our fitting problem to an optimization problem, can we use gradient
descent? which we discussed in the previous week. So gradient descent we used as a general one
box algorithm in order to find out minima and it turns out we can use gradient descent. How do
we do it? Very simple idea again, you start with an x some data point for a temperature that is
given for an example, guess some w, this w is remember true for all x, run it through the
hypothesis our ^y was linear function w 0 +w 1 x , the ground truth is already available, we just got a
hypothesis because we guessed the w. Now there is going to be gap between y and ^y , square it,
sum it, that is going to give you the net cost of the coefficients that you have chosen. Remember
this net cost is because we have chosen some w 0 and w 1, then find out gradient and improve your
w by using gradient descent. so let see this again. You have m data points, let say 51 just say an
414
example, now for each of these data points you can get a corresponding ^y provided I give you
some guess for w 0 and w 1. You have the corresponding output, then you calculate the net cost
2
function, which is ( yi −^
yi ) will give you the net cost function. Then you improve w by using
gradient descent. When do we stop? you keep on doing this (you have some stopping criteria, I
gave you three different types of stopping criteria and we will see at least two of them and then
an example shortly) so if you used your stopping criteria it will stop. The final results that you
obtain for your w are actually your regression coefficient. So you can carry out this whole
process, but theoretically you have only one small catch, how do you calculate this gradient of J
with respect to w ( dwdJ ), So let us see that.

2 2
yi ) , ^y was w 0 +w 1 x . So you get ( y−w 0 −w 1 x ) , and
So our J, as you saw on the last slide is ( yi −^
we want gradient of J with respect w. Notice that w is a vector, so w is basically in our case
w 0 , w 1, I will avoid the transposes, similarly gradient of J with respect to w , which is nothing but
∂J ∂J
and . I am just writing out this both the components of this vector equation
∂ w0 ∂ w1
∂J ∂J ∂J ∂J
w 0=w 0 −α and w 1=w 1−α . Now we want these two expressions and . I will
∂ w0 ∂ w1 ∂ w0 ∂ w1
415
show you that they can be written in a compact form like this
m
∂ J −1 ( y(i )− ^
y ) x❑ j ; ∀ i=1 ,2 ,..., m∧ j=0 ,1,2 , ... ,n. Please do not pay too much
(i) (i)
= ∑
∂ w j m i=1
attention to this before the derivation, that the derivation is extremely straightforward actually.
so let us take the case that m is equal to two. I will just do the derivation for that and you can see
that it easily extends to any number of n. So let us give a proof of this statement. So suppose I
∂J 1 2 2
want
∂ w0
for the case m is equal to 2. J=
2m
( [
y(1)−w 0 −w 1 x11 ) + ( y(2)−w 0 −w 1 x 21 ) ]
∂J 1 ❑ ❑
= 2 ( y(1) −w 0 −w 1 x11 ) (−1 ) +2 ( y(2)−w 0 −w 1 x 21 ) ( −1)
[ ]
∂ w0 2 m
∂ J −1 ( (1) ^ ❑ ❑
=
∂ w0 m
[(1) (2) ^
y − y ) +( y − y )
(2)
]
m
∂ J −1
= ∑ ( y(i)− ^
∂ w 0 m i =1
y )
(i )
You can see easily that this will continue for any m. Now what is this term ( y(i )− ^
y(i) )? This is
nothing but the error. So what it tells us is that the first component of the gradient is simply the
sum or the mean error (it is not means square error it is simply mean error). Also notice you can
write ^y =w 0 +w 1 x=w 0 x0 + w 1 x1 and usually we will call this x 0 as x 0. You will see the power of
∂J
this notation later. Because of this when I said I am going to say xi0 where x 0 is nothing but
∂ w0
1. It just lets me right this whole expression compactly.
∂J
Let us look at , I will derive this in slightly different fashion just to give you another tool
∂ w1
near tool set.
m
∂J 1 ∂ 2
= ∑
∂ w 1 2m ∂ w 1 i=1
[
( y(i)− ^
y )
(i)
]
416
^
[ ( )]
m
∂J 1 ∂ ( ^ )
❑
−∂ y (i )
= ∑
∂ w 1 2m ∂ w1 i=1
2 y (i)
− y (i)
∂ w1
m
∂ J −1 ∂ ( y(i )− ^
❑
= ∑
∂ w 1 m ∂ w 1 i=1
[ y ) x
(i ) (i)
]
So what are the steps of the linear regression procedure, we first decide on our leaning rate
remember learning rate is required for gradient descent, and we also have to decide on what our
stopping criterion is, we will make an initial guess for the weight vector, then systematically you
will calculate the next iteration. Now that you have one weight vector, you have one guess for w 0
and w 1, we will make another guess, how do you guess this? By using the formula that we just
derived. Once you update your w you calculate your stopping criteria, if this works out you stop,
if it is not satisfied you go back here you calculate once more you keep on stepping through the
radius.
So in the next video we will actually see a code to implement this and hopefully all of this things
will come together very nicely this is another reasons why we insist on doing the code because in
theory you might understand something it is only when you actually see it practically
implemented into a code that things become clearer we will be using as we had declared earlier
will be using an example through a MATLAB code all of you are welcome to use whatever
417
programing language that you would like to see but MATLAB is usually the easiest to explain
things as well as visualize things nicely so we will see that in the next video, thank you.
418
Coding Linear Regression
In the previous video we saw how to do linear regression using gradient descent. In this video
I will show you a simple example of how to do that an actual code. We will actually switch
between two codes back and forth, the same code, one of them has more textual or text
comments and another one is pure code as you would likely see. In case some of you find the
fonts of this a little bit difficult to read, you can also follow along with the code which has
been given on the website on NPTEL for week 4. So you will find these codes on the website
for NPTEL also.
419
So what we will be doing here is trying to setup a simple case for a linear regression. We will
create synthetic data (the previous example that I showed you was actual data of temperature
versus alpha the thermal coefficient, but in this case we will create some fake x and y data). I
will also tell you how we are going to create that data, just so that you can check how to do a
linear regression with some simple data.
So if you remember the process that we had started for the course, we first wanted to set the
parameters, so α for example is the learning rate it is a hyper parameter as you might
remember. We also wanted to set some stopping criteria, and I am going to use ε for the
stopping criteria, and we will look at various different stopping criteria later.
420
Now let us come to data generation, data generation means generate synthetic data wherever
we do not have real life data, in this case we just want something which will take some x and
y, some randomly distributed points and we try and fit a line to that. So the way we are going
to generate this data is, I am going to create a formula by myself say y=15 x+5, and then I
am going to add some random noise to it. Now how this random noise is added? What
Gaussian noise means? etc will come to much later, but this is just some simple method for us
to create the data.
So let us now run this section by section and you will see the form of data that we have. So
on your screens you should be able to see some x, y data. So you have some x data here, y
data here and you can see that roughly it looks like a line will fit it well. so let me go back to
421
the code here, this is the same code without all the comments that I showed you earlier. So I
am using what is called the debugger. You will see this the same data here x and y data with
all these randomly distributed points, ideally we should get a line through all this.
So now we go step by step with our process. We initially take some random values of w 1 and
w 0. Remember we are trying to model this as w 0 +w 1 x, that is our hypothesis function. So we
will try and model this with some random w 0 and w 1. As we go through this, we will create
some new variables, error here is our actual stopping criterion. We would like our stopping
criterion to be error is less than ε. Now what are the various possibilities? I showed you three
possibilities, there are multiple other possibilities also. But let us say you can look at how
much is the cost function in this iteration versus how much was it in the previous iteration, if
it kind of converges that is one form of error. Another form of error is what is the w at this
iteration minus the w at previous iteration. Remember w is a vector, so instead of absolute
value we will actually have to take a norm, in order to find out how close the two values are
to each other. So in order to start this loop we give some randomly high value of error, in this
case I am giving 1 and iter actually contains my iteration number.
422
423
So let us come here (ref time: 4:37), we will run our code till this point (ref time: 4:47), yh is
our hypothesis function. So a hypothesis function in this case is a simple linear function
w 0 +w 1 x, that is what this piece of code says. After you find this out, you need to find out the
gradient. So if you remember from the previous slide, since we have two variables with
which respect to which we are taking the gradient, remember both w 0 is a free parameter and
also w 1 is a free parameter. So the gradient of our loss function J has two components, we
m
∂J 1 ^ ❑
= ∑(y −y )
(i ) (i )
have derived this in the previous video. So and
∂ w 0 m i =1
m
∂J 1 ^ ❑
= ∑ ( y − y ) x . So that is written here in code form at this point. So what we do
(i) (i ) (i)
∂ w 1 m i =1
is, we initialize some w 1 and w 0, find out the corresponding hypothesis function for all points.
So this is the ground truth y shown here and now you are going to have a line which is the
line that you are going to fit based on the guessed values of w 0 and w 1. So if you see on your
screen, the guessed w 0 value is 0.16 and the guessed w 1 value is 0.52. This is just some
random guess that the machine has thrown at us.
So let us move a little bit further, so if we come till here (ref time: 6:12), we have now an
updated value of w 0. Now you have a hugely updated value w 0 is 18.25 and w 1 is 11.44. Now
based on this update we are going to get an updated value of the cost function. So the
previous cost function at 0 th iteration was approximately 82, now the cost decreased to
approximately 66, this is what we notice in this. So now let us plot what our function actually
looks like.
424
425
So you see the original data was like this all the dots, all the circles, and our linear fit is
actually somewhere here this, is not at all fitting the original data. So this is after one
iteration, this is the situation that we find ourselves in. You will see that the old J was 80
point something and the new J is approximately 66. So it is decreasing but it is still not good
enough, there is a lot of difference between the data and our actual points, which is shown by
the value of J also. So the value of J that we have is high.
So let us run this once more, so now you find that the fit has now jumped from here to a little
bit more, we can also check the values of w 0, w 1 that we have got, Now w 0is 1 point
something and w 1 is 3 point something, and that is the guess that you get. This we are getting
due to the update that we have for w according to our gradient descent formula, which is
sitting here.
426
Now let us continue once more and you see that J is uniformly decreasing. Now a warning,
this will not necessarily happen in every practical code that you write, it just happens to be
true for this example that J is sort of continuously decreasing. So you see this again the fit
has improved a little bit, may be you might not agree that it has improved, but you can see
that the line is here, our ideal line would be somewhere there, and J is also constantly
decreasing.
So now let us observe what happens as you keep on moving. You will notice here what I am
printing out here is actually the difference between J at this iteration and the J at the previous
iteration. I have asked the code to stop as soon as it reaches below 10−6. So you will see that
approximately after 72, 73 iterations we have got a very good fit. This data fit is actually
427
pretty good and we can now look at our final values w 0 is 5.04 and w 1 is 14.9, these are the
values that we have obtained after these many iterations.
So linear regression has worked, you can change your error criteria from the criteria of
finding out the difference between two Js to the criteria of finding out what is the difference
between my w at this iteration and the previous iteration, and we can run this too. You will
see that the number of iteration is slightly different. You will also observe that visually the
graph does not change very much as J has already reached very low values, but it depends on
how much tolerance you have for what J is. So around 150, 200 iterations, w will stop, so w 0
is some value and w 1 is some other value.
Okay, so you can use any of the stopping criteria you want, some people actually put instead
of putting an error criteria they will put a maximum number of iterations that is what is
computationally feasible.
428
429
So just to show you the same thing in the other place, remember all we have done is
∂J ∂J
w 0=w 0 −α , w 1=w 1−α , and all these plots. So you can take a look at both these
∂ w0 ∂ w1
things, this particular video showed you a simple example of how to use linear regression to
fit some random data. There are some certain things here, one thing that we have not yet
discussed is this I have just fit a line, but in the previous example I showed you a linear fit, a
quadratic fit, and a cubic fit.
So can you do a linear, quadratic, and cubic? It turns out that you can with a small minor
modification which we will see in the next video, thank you.
430
Generalized Functions for Linear Regression
Welcome back. In the last video I has shown you how to write a generalized linear regression
routine. We have discussed how to take x as a vector with a scalar y as output and how to map
one to the other using a linear model. So in this short video I would like to show you how to code
this up. I am going to just briefly go over it, we will leave the responsibility of reading the code
and understanding it to you for the most part. This is a simple a MATLAB routine which will be
available with you in the NPTEL website in the week 4 portions.
So what I have written here is a function. So this function which we called generalized linear
regression, a simpler version of this without all the comments is available in lin model, it is
exactly the same as this code. Both this codes are available to you on your NPTEL website. Once
again I am repeating this but we are welcome to write some such thing for yourself in any other
language that you are comfortable with. So this is just for your understanding. So what we have
here is a regression model which takes in x the input vector, y the corresponding output vector.
431
Remember you have a whole bunch of examples, which is why I have written input matrix here
because, x for each example is actually a vector.
As we discussed in the previous video x could itself have x1 , x2 , x 3 etc depending on the number
of features or attributes that you have. α is the learning rate that you think will be good for this
problem and ε is the stopping criteria for the problem.
The output which this function gives out is the whole weight vector. Notice here as we had
discussed in the last video, I am actually including the bias term w 0 which some people call b, I
am including that in the weight vector. Therefore the size of the weight vector has to be this
n(features)+1(bias). So the first thing that we do since we have not given it explicitly this is
your choice, MATLAB has an easy way for you to determine the size of the incoming data, this
might or might not be available in other programing languages. So we have used this feature
easily. So we find out the number of examplesm and the number of features n simply by looking
at the size of x.
432
We also make an initial guess, notice n+1is the size of vector, because you include w 0 also, and
then this is all of it is the same as before.
You actually iterate for W using gradient descent, notice now that the hypothesis function is not
simply w 0 +w 1 x , but w 0 +w 1 x 1 +w 2 x 2 +...+w n x n, which can be written as w T x , in case w includes
w 0 also and assuming that x 0 is equal to 1. So we find out the hypothesis function, you will
notice a 1 sitting here because I am writing x 0 explicitly, x is simply x 1 to x n.
433
There are many ways of writing, it I have left that as an exercise to you, I have written a slightly
inefficient version, but you can write more efficient versions. Once again like before you have
several choices for stopping criteria, either you can choose the difference between the current
value of the loss function and the previous value of the loss function or you can find out how
much does the current w differ from the previous w.
In either case we simply call this stopping criterion as error, till it is greater than ε keep on
running.
434
Our main task of course is to find out the gradients of J with respect to the weight vector w. I
have written the formulation here we had also derived it twice in the previous videos, and you
will notice prediction error multiplied by the corresponding feature component.
m
∂J 1 ^(i ) (i)
So if you look at = ∑ ( y − y ) x j, which is what I have written here. DJ is what denotes
∂ w r m i=1
∂ J , notice that
DJ has n+1 components, starting from the first component which will
∂w
correspond to w 0 and will go uptil n+1which will correspond to m.
435
∂J
So we take that, we calculate the change in w, (w is ofcourse w−α ), finally we calculate
∂w
m
1 ( (i ) ^ (i)
)
2
what the hypothesis is and this lets us calculate the cost J= ∑ y − y .
2m i=1
I have also included some plots just for similarity from before and we can try running it in order
to see how this performs. Now the important thing here is this is really general, you can take any
436
number of examples m, and also you can choose any number of features n, so the code is
supposed to work. As we discussed in the previous video, this lets us use not only linear
regression but it also lets us use polynomial regression, because all we need to do is to change
1 2 n
w 1 x , w 2 x , ... ,w n x . So if x the incoming vector is basically [ x ¿ ¿ 1, x 2 , x3 , ... , x n ]¿ , you can
simply use that as the features of the polynomials and that will work.
So let us now try using our original data and see if we can now use our generalized linear
regression code in order to make linear, quadratic, and cubic predictions. You will see one small
surprise here which will lead us to a small modification for what we want to do for linear
regression. So as before, T was the initial data and the α is y (notice that alpha is multiplied by
10
−6
as the expansion coefficient). So initially we define x as T and y as α , we choose a learning
rate of 1, we choose an ε of 10−5, and we will try learning our generalized linear regression code
and see what happens. So when we run it, please notice what is happening here. You can see
these numbers growing larger and larger, in fact J is rapidly increasing, and Jhas now reached
10
247
and y has also reached 10128 for the prediction, because we are not getting convergence, we
are actually getting divergence..
Now you might decide that this is because of a large learning rate and try to reduce it. So let us
say we make learning rate 0.1 instead of 1 and we try running it again. You will see that the
situation has not really improved, it is still blowing up, you have still got high values of J and
437
you have got high values of y. Now why does it happen? This happens due to a certain reason
that is because the T that you have here or the x that you have here, please notice x is going from
80 to −340, of course our y s are 10−5 times all these values. So your coefficients that need to
come so that this x can be mapped with this y are extremely small. All these problems are
essentially what are called normalization problems. For example let us say you are matching the
area of a house to its price (an example that I have used before), so in what units will you give
the area? You could make the area in square foot, which is normal thing, or you can make it in
square meters, you could make it in square kilometers in which case your input vector will look
really small, or you could make it square centimeters etc. Similarly suppose you are mapping the
height of a person to his or her weight and that is the regression problem that we wish to do, in
what you unit should you represent height? Should we represent it in micrometer, should we
have represented in meter which seems reasonable to us, or should be represent it in foot etc. If
you represent it in kilometers, your input vector will look really small and your weights will
change appropriately. So we tend to try and use units where you normalize x so that typically it
varies to the order of 1. So that turns out to be the simplest thing to do, you can make it so that it
varies from let’s say −5 to 5, or from 0 to 1 which is probably the easiest. So the choice that we
will do right now, and this step is called normalization or re-scaling is to rescale the data, so that
I have a new x, I will also rescale y (you know remember we have all this numbers and I was
multiplying by 10−6 which is arbitrarily as far as the code is concerned, I will comment this out
so that y is now simply these numbers). x is rescaled, the way this re-scaling was done is x goes
x−x min
to . When you do this, this x n will actually get rescaled, so that it is only between 0 and
xmax −x min
1. So now that we have done the rescaling, let us try and see what x looks like. So we have new
x here, if I write it out you will notice that it goes between 0 and 1. So you can see that the
maximum is 1 and the minimum is 0, this corresponds to the following x which was 80 to −340.
All we have done is x has been rescaled, the minimum has been subtracted out and has been
rescaled by the range. So now x goes from 0 to 1. Now we can try and see what happens when
we continue our code.
438
For the same learning rate you will now see that the linear fit starts working. So there is a drastic
difference. When we didn’t rescale, it was going completely wrong because the values of x were
high and the corresponding value of ^y was also high. So you can see that certain things can
actually make a great difference as far as training goes, training means finding out the
coefficients. So you will see that the fit is now working and the only change we made was we
rescaled the data. So this trick is an important trick, in fact it has been generalized to something
really big called batch norm which we will see later on in the course. So I will stop this
stimulation.
Now another thing you can do is with the same code we can now try and get quadratic. So notice
this we keep the same x n we keep the same y n, now I change my feature vectors and I say that
the input is not only x but it is x as well as x2 . Now our code is written such that, the moment I
give it one extra x or an extra feature, it will start reading more features and it will fit a bigger
model. So this is the trick that we use, the moment I gave x and x2 , obviously the code doesn’t
know you have given x2 as the second variable, all it knows is x 1 and x2 . So the moment it sees
x 1 and x2 , it will say that my model is now no longer w 0 +w 1 x 1, but it is w 0 +w 1 x 1 +w 2 x 2, which
servers our purposes because x2 is now x2 . So let us see that now, just to show you what happens,
I will run this again. So if you come here, you will now see that the number of features with the
code is recognizing is two, and you will also notice that w is now a 3 ×1 vector, and this is the
initial guess it has given of 0.8 for w 0, 0.14 for w 1, and 0.42 for w 2, and this servers purposes.
439
So will continue here and you will see that now it is trying to fit a curved line, a quadratic line to
this data. I will let you run this on your own and it is not very hard to change this into a cubic fit,
because all you need to do is to add this extra term w 2 x2. Now we can now start from scratch and
run this and it will try and fit a cubic plot. You can see that this is slightly more curved. I would
encourage you to play around with this code or write one on your own and see what’s sort of fits
you can get for this kind of data. You can of course try it for any data. So what we have seen in
this video is that rescaling helps you and that you can actually fit with the same code you can fit
linear, a quadratic or a polynomial depending on what sort of input vector you give it. I will write
down what is needed to do this once more.
x−xmin
So if we look at normalization or rescaling, what we did was x := and let’s call this x
xmax −x min
for our purposes, and this is our new input vector. What this does of course is x now will vary
x−μ
between 0 and 1. There are other alternatives for rescaling, this is to say x= , where μ is the
σ
mean of the data and σ is the standard deviation of data. Typically this is called normalization
and this doesn’t ensure that you are going to lie between 0 and 1, it will usually go between −3
to +3 or somewhere in that range, this is simple rescaling.
440
So you can use one or the other, normalization is used in with great effect in something called
batch norm. Batch norm is very-very effectively used in several deep neural networks and you
will see this a little bit later.
441
Goodness of Fit
Welcome back, We have been seeing various types of fits for the same data, for example for the
same data we saw a linear fit, we saw a quadratic fit and we also saw a cubic fit. Now intuitively
we could see that cubic fit is better in this case compared to a quadratic or a linear fit, but
sometimes it is not really visually obvious especially if you are dealing in high dimensions. For
example if I have this line versus this line, which one is a better fit? So that is one question. So
immediately we might say that obviously I should look at J to find out how good a fit is, and my
m
1 ( y(i )− ^
2
y ).
(i )
J in that case was ∑
2m i −1
So basically this was mean square error. So this is one measure of how good the fit is.
Sometimes this is not a good enough measure for multiple reasons, sometimes we just get a large
value of J and we don’t know whether this is a good fit or not. Typically we would like one
number which lie between 0 and 1 where we can say something like 0 is a really a bad fit and 1
is a very good fit. So we want to normalize this, this kind of thing will repeat again and again.
442
You have a number, you would like to non-dimensionalize it, normalize it with respect to some
denominator, so that you get an idea between 0 and 1.
So we will try and do that, a measure for that is something called R 2. So where we are going to is
R will lie between 0 and 1, where basically 0 means really bad and 1 means great. For this we
2
need three different measures of sort of variance and data and let us look at these three different
statistical measures. So let us say once again that this is our original data and I have my
hypothesis function h( x). Now the ground truth at any particular point is y but what I am
predicting is ^y .
So let us say this is the point x(i)❑, this is ^

y , and this is y . So the difference between these two
(i ) (i )
is some error, this is the predictive error. Originally that data that is going here and there has
m
2
some variance. Remember what is the variance? The variance is ∑ ( y(i)− y ) . So this was you
i=1
might recall from our probability week, this is simply the definition of variance. This is
sometimes called SST, where S Stands for sum, the second S stands for square, and T stands for
total. So this term is known by various terminologies, sum square total or basically this is the
total variance. What does total variance mean? Before we even had a model there was some
amount of variation in the data, and this term actually calculates the total amount of variance in
the data before we even had a model. Now we also have this previous term closed to the previous
m 2
term anyway, so let us look at the second measure of error ∑ ( y(i )− ^
y ) . This is our error in
(i )
i=1
prediction and it is known by the term SSE, where E stands for error. So the meaning of this term
is fairly clear, this is simply the difference between y(i ) and ^

y , where y is the ground truth and
(i ) (i )
y is our prediction or our hypothesis or our model.

^(i )
443
m 2
^
So this term is known as SSE. There is a third measure of error, this is ∑ ( y(i )− y ) . I will name
i=1
this term as SSE and then explain what it physically represents. Once again SS is sum squared, R
stands for regression. Now what does this denote? So suppose I mark y here, what this says is,
just like variance told you how much this y(i ) vary from y, this term SSR or sum square
regression tells you how much does your prediction vary from y. So this is the variance in
prediction. Statistically we will go into much detail about this, this is called the amount of
variance captured by the model, and the first term is amount of variance present in data. So R 2 is
defined as SSR by SST, physically this means amount of variance explained or captured by
model divided by amount of variance present in data, and it can be shown R 2 will always be
between 0 and 1.
In the best case scenario your model actually predicts all the variance which is actually present in
the data and in that case it will be equal to 1. Now it turn out that there is a nice relationship
between SSR, SSE and SST. It turn out and I am going to just say this without proof
SST =SSE+ SSR, you can try the proof as an exercise, it is a slightly tricky proof I must mention
that, in case you wish to try it out for your own edification. So obviously you can use this to
2 SSE
realize that SSR=SST −SSE, which gives you that R =1− . This is usually the form in
SST
444
which it is implemented because, we are anyway calculating this term, the sum total error or the
square sum error because that is really just the non-scaled part of J.
So what we have seen in this short video is that you use R 2. I should not really use the term
Goodness of Fit because it has several technical meanings, but at least one number which is often
used this something called R 2, in fact if you use many of MATLAB’s in built routines, they will
tell you after you make a particular model even in neural network model, because this is really
general, if you use a neural network model it will tell you your models R 2 is so much. So if you
have something about 0.95, 0.96, that is a great model, that is a very good model, because a lot
of data actually has been captured by or the lot of variation in the data has been captured by what
your model is and occasionally we will be talking about R 2 for various problems with respect this
course, thank you.
445
Professor Dr Ganapathy Krishnamurthi
Bias-Variance Trade-off
Hello and welcome back, with this video, we will about the bias variance trade-off, which is
an important component when you try to train different machine learning algorithms. Most of
the examples and illustrations are taken from textbook by Christopher Bishop, Pattern
Recognition in Machine Learning ,and we will use those illustrations and material to
understand bias variance trade-off. So let us consider this simple example, where we are
446
shown this green curve, this is sinusoidal curve, from which we draw some data points at
some specific intervals, in this case right on top here, and we just add noise to it to obtain this
dataset which is given by the blue dot, blue circles.
447
So the idea here is to, given the data set consists of these blue circles, we want to fit into a
polynomial of this form given here. So, what we are going to do is to vary the degree of the
polynomial and see how the fix look like. So, how do we fit them? We use an error term to
perform the regression and in this case, it is just the least squared error, which is given, which
is illustrated in this figure. So t is the ground truth or the correct answer and whatever your
polynomial outputs, and it is given by y, so your error is basically what you are going to use,
m
2
this ∑ ( t(i )− y ( x(i) ; w)) . So given with his error, we are going to fit the given data set to a
i=1
polynomial of varying degrees. And I will just show what each of them look like. So we have
a zero degree polynomial, which is nothing but a constant term, so as you can see there is a
red line, which is actually the fit, the fitted curve, of course does not match the data that we
have used.
448
Again we use a polynomial of degree 1, which is nothing but a linear fit, and once you start to
go, get higher degree polynomial, in this case polynomial of degree 3 here and so on and so
forth, till we come to the polynomial of degree 9. In this case you see that the red curve,
which is the fit, that is the output y ( x(i ) ; w) fix goes through every one of these points. So it
goes through all the blue circles. However in between the blue circles you see that it is off,
what do you mean, what do I mean by this off, so we know that our new circles are drawn
from this green curve.
So ideally when you are done with the fit, you would expect the red curve to lie close to the
green curve, but in this case, in between samples is actually off. So even the with higher
degree polynomial is, we are able to fit every point exactly, so that our fitting error is very
small. We see that in points, other than the blue circles, it is actually quite far from the
ground truth.
449
So, if we actually plot the error, the error as we defined in the previous slide I have showed,
so as you fit the error 4 different values of M that is degree of the polynomial, see that as we
hit the higher degree polynomials, the error on the training data is very small, which is what
the blue circle indicates. However the test data error, this starts to diverge. So, this
phenomenon is referred to as overshooting. Similarly has become back here, we see that
again, there is quite high when we are using a polynomial of degree zero, that is we are just
fitting it to a constant function. So in both the cases, we have a fairly large error, one in this
end of the spectrum, we can call this under fitting and at this end of the spectrum, we will call
it over fitting. So, just to summarise, we will see the similar plots, we will look together just
450
to get an idea of what is going on. So we have the selection of these blue points, which we try
to fit to different polynomials, polynomials of different degrees.
And we do that by estimating the parameters, in this case the ws are the parameters, in this
case it is a linear fit, so there is only one parameter w 1, there is w 0, which was not chosen the
figure. So what we are seeing here is basically, here we have this one model, which is a
polynomial of degree 1 and this is polynomial of degree 2 because it has 2 more terms x and
x , and this is a polynomial of degree 9. So as you can see the polynomial of degree 2 seems
2
to fit properly in the sense that it goes through all the blue points and this is when we
visualize the green curve superimposed on it, it is actually close to the green curve also.
While in this case, when you use higher degree polynomial, it actually does again fit through
the blue points correctly but in between the blue points, where there is more data, which is
not shown here, you see that the blue curve is actually off. But what is the fit here is basically
obtain with polynomial of degree 2, atleast among these 3 models that we see here. So this is
referred to as a model with appropriate capacity. So when you say capacity, you can say
basically the number of parameters in your model.
451
So if you have a polynomial of degree 9, at least 9+w 0=10 . So, given this situation, we want
to figure out what the bias various trade-offs does is to give us an idea how to figure out the
appropriate capacity for a given problem.. So we again, once again we look at this plot here,
which shows the capacity on the x-axis and the error, that is the training and testing error that
you get for a given model. So just for ease of understanding you can think of capacity as the
degree of the polynomial.
So we can see as you increase the capacity beyond a certain point, the training error does go
down, right, it is much smaller. However the green curve which shows you the generalisation
error in that sense, generalisation error is when you use testing data which are not part of the
training data. So, we saw some of those blue circles in the plots are like because the data used
to train our models. Suppose we choose some other points which do not coincide with the
blue circles and then give it as input to our model, the output is actually quite far from the
ground truth, which leads to a very high error.
At this end of the spectrum, we once again see that both the training as the generalisation
error are quite high. The plot can be little bit misleading, it seems to overlap there but all you
have to understand is that the error is very high, so that is not also a desirable thing. So,
somewhere in between is the optimal capacity which in our, in the case of our example, it is
basically a polynomial of degree 2 or 3, which seems to give the best fit, then it goes through
all the low points are very close to them. It is also close to the green curve, which is our
ground truth.
452
So by altering the capacity we can decide whether, we can make the model under fit or over
fit. Under fitting is when you have very large error in terms of accuracy of the model. It is far
away from the ground truth, the research that you get. And hear this as overfitting, wherein it
actually fits the training data perfectly but it does not generalise very well, so new data gives
the very large error. So just to summarise this, this is the usual way in which this bias and
variance are visualised.
So in this case look at this model here, this is called the high, this is equivalent of the high
bias model, that is, we want to be close to, this is the dartboard example, so we will like all
the darts to hit the bull's-eye, which is the centre. But all the darts have actually hit quite far
away from the bull's-eye, this is that they are not accurate. So it is high bias, but they are all
very close together, okay, so that is low variance. Again if we look at for in the case of high
bias, and high variance, once again, most of the darts has fallen far away from the bull's-eye,
but again they are not very close together, they are highly dispersed, so that is high bias and
high variance.
453
The ideal model which is what we would like is the low bias and low variance, that is most of
them are close to the bull's-eye in and around it with very small dispersion. And the other
case is when we have the low bias and the high variance model, wherein they are simply
close but then again this was far away from each other, okay. So, this bias various trade-offs
is basically, it relates to model complexity or crudely we can think of them as our number of
parameters and the basis functions. When we see the basis functions, so for instance, in the
case of the example we saw some other basis functions are nothing but the x , x 2 ,... x M . where
M indicates the model complexity. We have a lot more parameters, so general it is a complex
model. The error that we get in training, so you can think of this as the fitting error or the
training error, in this case it can be decomposed into 2 components, one is the bias squared
error and the variance.
And these depends on the model complexity. This, contribution of the bias to the error and
the contribution of the variance of the error depends on the model complexity. So we what
actually go through the derivation for that, so you can actually start with a least squared error
and decompose it into 2 sums, one is the bias and variance. But we will just look at the terms
themselves to understand what they mean by that.
454
So if we consider dataset, and we have seen that y is our model and ws are the parameters of
our model, x is the input and you can think of capital Y as a ground truth or the correct
answer that would like to get. So Y
^ is the statistical estimate, we will see later that why we
call this as statistical estimate, for now we can think of this as the answer that is a prediction
made by your model, or the output of your model. So the bias is the expectation value of the
difference between the model prediction and the correct value. So we all know what
expectation value is, but here I am only talking about the single values, I will clarify later
2
what we mean by expectation. So you can define bias to be this term Bia s2 =E { [ Y^ −Y ] }. The
2
variance is the variance of your prediction itself, given by Variance=E { [ Y^ −Y^ ] }. So you will
455
have a range of predictions and the variance among them is what we call the variance,
obviously the name implies variance. So the error that we get, the fitting error that we
consider can be written as the sum of these 2 terms. There is also one more term called noise,
this is noise which is inherent in your data, because all your answers are not, even your
ground truth has some errors in it, there is always some noise, so which we are not taking into
account in this model.
So the error that you get is the sum of these two, so if you look at the plot again, where we
consider model complicity along the x-axis and the actual fitting error along the y-axis, we
see that as the model complexity increases, the bias component of the error comes down,
which we saw earlier. Because when you look at the polynomial of degree 9, the curve that
456
we finally got was able to go through all the blue circles, that is the data points that we used.
Similarly as you look at the model complicity as you have a very simple model, in the case
we have the polynomial of degree zero, then the error becomes very high, the contribution is
very high. And If you look at the various component of it, as you increase the complicity of
the model, you have higher various and as you decrease, the variance decreases. Okay, so
these 2 add up to give you the total error, so there is a point at which you have an optimum
error, right capacity model, it strikes the right compromise between the bias and the variance.
So, typically, to understand it from the point of view of curve fitting, we have a simple
model, which in this case, you know we have, let us say these red are the data points that we
have drawn from some complex function and we are trying to fit it using a simple model, in
this case the straight-line fit. A simple model will give you, it has insufficient number of
parameters and features, but it will have higher bias. On the other hand if you use a slightly
complex model to fit the same data, which is what we have seen in the right, then it will have
a lot more features than you actually need. But it will have very high variance, you can say
this will have very low bias, in the sense it will actually be able to fit through all the data
points, but you are variance will be very large. On the other hand, simple model will have
lower variance. So this is how you understand it from the point of view of curve fitting.
457
Now, crudely speaking how do we measure this bias variance? We will not do it this way but
just to understand what these terms mean, we will have to go through this process. So here
some of these terms may not be familiar to you but I will make the understanding easier. We
will consider B bootstrap samples of variance of dataset X . So, think of it as the 2 have
access to, let us say 10 different or 20 different datasets drawn from the same data
distribution. In this if we go back to our example, so we had this green which was a
sinusoidal curve, you sample 10 points at a time from the sinusoidal curve and then you do
that let us say 10 or 20 times, okay. So you will have about 10-20 datasets, each dataset
having 10 points, that is what we call bootstrapping. So we have B bootstrap variants, from
our data, corresponding X and y. So for each of the bootstrap set, we call the data that we
have extracted as the training set, and for each of them we will have a corresponding test set
also. So the way to think about it is you have X and y, you will have one set of xi , so this is
one dataset that you draw from our curve, let us say the green cover that we saw, the
sinusoidal curve, and then you will draw different set of X and Y , again xi and the
corresponding y i. So you will have in this case B datasets. B can be 10, 20, 100, whatever
you like. For each one of these datasets, we will fit to a model, which can be, in our case for
example can be a polynomial of degree M , but of course you fit all of them to the same
polynomial, and then you test it on its separate held out dataset, each of them will have a
different test data.
458
Now that we have B different datasets, for every model that we use, let us say we are using a
model of degree 2, there is a polynomial of degree 2, for every X we will have many
predictions, y 1 , y 2, up to the number of data points we have. So, what we wanted to do for
each X , so remember that after we fit this model, we can evaluate that model for any X . So, if
there are B models, so we will fit B models to the B bootstrap variance or the B samples that
we have, B datasets that we have.
We can evaluate for single X , we can have B Y s. So for one X , since you have B models, we
can have outputs up to B. So the variance is nothing but the variance of those outputs. So that
is the variance of our model, variance of those outputs. The bias is nothing but the average,
for everyX you can calculate an average of the Y s that we get and subtract it from the ground
truth and take the square, that is a bias square. So this is how you can actually calculate the
bias and variance for the model you choose. Just to summarise, we have B datasets, all
coming from the same distribution and each one of those datasets, you will fit the same
model. So, in this case you decide to fit a model with a polynomial degree 2. So
correspondingly you will get the model parameters. Now that you have a model parameters
for each one of these B models, you will plug-in individual X s, so for each X will get B Y s.
The variance of those Y s is the variance of your model, the mean of those Y s - the ground
truth squared is the bias. Of course you realise that doing this on a real dataset, especially
when you have a large model is not going to be feasible, especially since we will not be
having access to some many versions offer dataset. So how do we actually do it in a real
scenario?
459
One of the prescribed methods is to split your existing data into a training set, which can be
60 to 70 percent, there is something called a validation or development set, this is what you
will monitor, in order to determine whether you are model is a high bias or high variance, and
you have a testing set in the end, once you figure out the correct model using the validation
dataset, you will test it to see whether it is as good as you think it is. So, then you have model
capacity and the error. So you will plot both, the training error as well as the validation data
error for different model complexities. So if the validation data and the training data error, are
both high, we saw that plot a few slides ago, then it means that it is a high bias problem, that
your model has a high bias. On the other hand, if your training error is very low, but you are
validation error is very high, then you have a high variance problem.
460
So this is what is in this case. So your training error became very low but your validation
error is very high, okay. So, just to recap, the idea is to split your training data into 3, one is
the training data, which we will use to figure out the parameters of your model, and
validation data is that, is the one that you will use to check whether your model has high bias
or higher variance. If your model has very high training and validation data error, then you
have high bias, then the model has high bias. If your training error is very low, while your
validation error is very high, then it means that you have very high variance model, which
means you have very complex model, it might not be necessary for the data that you have at
your disposal.
461
Okay, so then, what are summarise here, so then how do we handle this problem. So, we have
very complex model that can lead to overfitting. But then you would like to play it safe if you
actually want to retain the complex model because it seems to generalise very well. So if you
think that it generalises very well, then how do you address the high variance problem, how
do you make sure that your fits are good? So this is accomplished using regularisation.
So what is regularisation? Which is in this case, we have the least square error
p ❑
λ
( ) 2
L= Y −∑ w i x i + ||w|| . So what we do is, we penalise large coefficients by adding a term
i=1 2
λ 2
to your fitting error. In this case it is ||w|| . And ||w|| is the 2 norm, you must have seen this
2
2 norm or the L2 norm. Add the L2 norm to your fitting error and then you do the fit as
before. So what it does is that it is a very large coefficients, this will penalise it.
462
So, why do we have to penalise very large coefficients? We will show you why that is
required and the will look after we have examined what the curves look like after you do the
regularisation. So, if we choose, if we have a same, we go back to the polynomial example
where we are now using M equal to 9, and we saw earlier that there was a huge error, the
fitted curve which is a red curve was oscillating wildly, but now it is a little bit more smooth,
because we added the regularisation term. This log λis how you measure the strength of the
regularisation term. So, because typically λ is a very small number between zero and one, so
it can be like 0.01 or something, so expressing it in log λ is more meaningful. So, when we
use a small hyper parameter λ, we have L2 regularisation, then even with polynomial of
degree 9, we get a reasonably good fit. And when we use log λ=0, this is very strong
regularisation, so then it actually becomes a very high bias model, this is very similar to what
we saw when we had just a polynomial of degree zero..
So by adjusting the strength of this λ, we can control the bias variance trade-off. That is the
idea behind having a regularisation. So, this λ is a very small number, so then we have a very
smooth curve fitting, but when we make our regularisation very small where λ is close to 1,
then you have a very high bias model.
463
So, then once we have the regularisation in place, in this case the L2 regularisation, then we
can look at the set of the regularisation strength of the regularisation versus the error, LRM
versus the error, the fitting error that we measure, the root mean square error. And we see that
beyond a point, it is a very good region here to operate. So we are okay here, by using a very
highly complex model, polynomial of degree 9, we are we are getting the training and test
data to be close to each other by choosing an appropriate value for λ.
So, what do I mean by penalising the high weights? So, if we do not have regularisation in
place and we fit using the 9th degree polynomial, you see that some of the weights that we
estimate, the parameters of the model are very high, we see that. Of course this when you
examine this, you see that this is meaningless, it should not be this way. So once we start,
464
once we adding the regularisation, so this is without regularisation, which is log λ=−∞, this
is some medium level of regularization, where we got very good results with a very nice fit,
and this is a very strong regularisation you see that most of them are going to a very small
number. So we have only included 2 or 3 significant figures after the decimal place, so it does
not show up. So, by adjusting λ, so we can actually do the bias variance trade off. So this we
saw, this is for the L2 regularisation wherein we add the L2 norm of the parameter to your
fitting error cross function.
We can also do the L1 norm, which is nothing but the absolute value of your parameters. In
fact many, you if you go to the deep learning, many of the network that you train will have
both the combination of yours L1 and L2 norms, okay. So, typically we have a very complex
models, you would end up with a very high variance model, in the sense, the generalisation
error is very high. sO will not predict out of, the data that comes, which is not part of the
training data, if you give it the data, it would predict properly.
So in order to control that high various problems, you add in regularisation terms which will
penalise very high values of your parameter estimate and smoothen your model to have
reasonable variance and reasonable bias, thank you.
465
Gradient Descent Algorithms
Hello and welcome back. In this video will look at some alternatives to gradient descent
algorithms. Just to refresh your memory, we have looked at gradient descent techniques for
optimisation. So there are basically 3 different versions, one is called the batch gradient
descent, where the parameter update is made based on the entire training dataset. So you
would calculate an average gradient for based on the individual grades that you calculate for
your training data points. The other extreme is the stochastic gradient descent where you
would update the parameter for every individual training data points. Here online learning is
possible because of can update your parameter as soon as a new data point arrives. But this
method causes some large oscillations in your objective functions, as well as your parameter
updates. To get the best of both worlds, what is typically done is minimise the gradient
descent, which is a combination of the above, that you take subsets of your training data and
then calculate the average gradient and use that to obtain the parameters.
So, for reference we have given the gradient descent update equation here w :=w−α δ w J , so
the current parameter estimate is the previous parameter estimate plus the update, so the
update is your gradient, with respect to the parameters, multiplied by the α which is called the
learning rate. So, what will do now is to go through some of the variants of this gradient
466
descent algorithm. So, primarily these these have evolved primarily by looking at how we can
make networks converge faster, which means deep learning networks converge faster to the
optimal solution.
Almost all of the algorithms are, you can use them like black boxes in most of the packages
that they introduced last time. So in that sense you do not have to really have to code them
but you just have to understand how they work and try out different things for your particular
implementation of a deep learning technique or machine learning techniques. So for instance
tensorflow or pytorch will have many of their algorithms, we will have them already coded,
and will just have to use that option.
467
So these are some of the methods that we will see, momentum-based, which will, then the
Nesterov accelerated gradient, Adagrad, AdaDelta, also RMS prop, very similar initially.
So we will just start off with the momentum update. So what this variant does is to add a
fraction of the previous update to the current update. So which means that it will take a larger
step in the relevant direction, so that we preventing oscillations and converging faster. So we
saw that δ w n=γ δ w n−1 + α δ w J ( wn−1 ) and w n=w n−1−δ w n. This δ w n−1 is the update from the
previous step. When you take the fraction of this update, so γ is typically 0.9 around that
value, then you add it to your current update right here δ w J ( w n−1 ) . So that way you take our
bigger step towards the relevant direction, of course and this is how you would actually
update your parameters. So this is the moment update equation.
The Nesterov accelerated gradient goes a little bit further. So what it does is , if you compute
the update parameter value, and then you add it to the previous iteration value, nd you treat
that like a lookahead, and you evaluate the gradient at the look at points. So this is the same
update equation δ w n=γ δ w n−1 + α δ w J ( w n−1 +γ δ w n−1 ), so this is the δ w n is the update to your
parameter. So you can have, just like the moment update, you have a piece of this previous
update. In the previous momentum update version, you calculated J with respect to w n−1, but
here you calculate J with respect to the lookahead w n−1 +γ δ w n−1 . So and then you perform
the usual update right here w n=w n−1−δ w n. So, this we it helps you get better estimates of
your parameter updates.
468
So the next version is basically where you try to have different learning rate for different
parameters. So, if you notice, in all the momentum as well as the Nesterov accelerated
gradient method, you had the same learning rate α for all the parameter. It is just that you
either took the previous updates as well as the lookahead to get a better estimate for the
current update. So, here what we do is we will have a different learning rate for different
α
parameters. So this is the equation w n, i=w n−1, i + . So your current parameter is, of course
√❑
your previous parameter plus this update. I will clarify the indices. So i is your parameter
index, so you might have hundreds of millions of parameters, so you would consider each
parameter at that time. So the current parameter is from the previous iteration n−1plus this
update equation. The α is your general base learning rate, g n, i is the gradient at current step
gn, i=δ n, i J , and G n, i= √❑ for that particular particular parameter. So you take the gradient
with respect to that specific parameter index i, you accumulate the gradients, the squares of
the gradient and take the square roots, and then you normalise your base at learning rate α by
that term. So, what it does is that if you have some parameter which is getting updated
frequently, some parameters will have very small or negligible updates, so some parameters
will have large and frequent updates. So, then we just make sure the updates are normalised
with this particular running average. So the disadvantage is that as soon as the number of
iterations increase, the denominator will go really large and of course your updates will
become very small. So g is just a shorthand for your usual gradient with respect to the current
value of the parameters. The G is the sum of the squares of the gradients from the previous
iterations up to the current iteration. So that is the Adagrad updates.
469
And AdaDelta and RMS prop, the basic version is still the same, except that instead of taking
the running some of the gradients, you took a weighted sum or exponentially decaying sums,
α
so that the denominator doesn't blow. So this is a similar formula δ w n= . The update is
√❑
very similar to what we saw earlier, instead of the G, which is just a running sum of the
gradients, you will have a weighted running sum. So this is the g2n is the gradient at the
current step weighted by this factor (1− ρ), rho is something between 0 and 1, plus the
weighted sum of the gradients from the till the previous. So, you can start of with E ¿ for the
first step. So, you would accumulate the gradients but then you would weight it every time
with this ρ. So this way you won't have the problem we had earlier with Adagrad wherein we
had the running sums of variants accumulating, becoming very large number and the
parameter updates become very small. Here you have a weighted sum, in this case you have a
weight for the current square gradient and then you have the sum of the gradient in the
previous step. So you can weight it like that and that way you will have a very decaying sum,
an exponentially decaying weighting average.
So these 2 methods AdaDelta and RMS prop, they're very similar, except that AdaDelta also
have another factor in the numerator, which is similar to this δ w n= α √❑ . So here you will
❑
have the expectations of the or the weighted average of the updates. So this is done to make
sure that the units match. So if you look at this particular equation, the let us say that the w’s
have a particular unit, some dimensions, and in this case they do not match. Because the
470
gradient dimension and here are the square root of the gradient square, they will cancel out
and alpha is there some constant. So in order for the units to match, you will have this
learning average, so it helps match that, that is one application. RMS prop does not have this,
so that is the difference,
So, there are many other techniques also, very similar something called Adam, which I have
not described here, there are lots of references online, you can look them up. So all of these
are quite very popular choices for optimising deep neural networks as well as general CNN
and things like that. Many of them are available as black box implementations in many of the
software platforms like tensorflow, pytorch, or even MATLAB, so you are welcome to go try
them out when you begin to coding your own deep neural networks. So, this includes our
session on gradient descent variants, okay thank you.
471
Professor Dr Balaji Srinivasan
Introduction to Week 5 (Deep Learning)
Welcome to week 5, this week we will be introducing you to deep learning. Last week we
saw linear regression which was a simple model that connected input to output via linear
model. This week we will be looking at more models, one is something called logistic
regression which is a classification model, and the next is the neural network and
subsequently we will go into what are called the neural networks, which is usually the
terminology used for deep learning.
So recall that last week what we looked at was, given some input vector x, if you want to
connect it to some output vector y via a linear model, for some reason you think that the
connection between input and output, the regression connection is actually through a linearity
connection ok. In that case let us say our h( x) with parameters w is assumed to be a linear
model, in that case all you do is you take a hypothesis function h( x), you say that my guess is
^y , you will have already got some ground truth y and using these two you calculate the cost
∂J
function J and you feed it back so as to improve w ok by looking at . So this is what we
∂w
saw in the last week and we saw that the same model could be used for linear, quadratic,
cubic and any type of polynomial fit.
472
This week we will just do us very simple change to this, actually you will be very surprised
that how simple this change is in case you have not seen this before and you will be able to
achieve almost universal computation. So the first example of this that we will see is
something called logistic regression which is a classification algorithm, this is what we see
first this week, let me point out how we will do that and you will see the details in the next
few videos. All we do is a very simple thing, remember in linear regression you had x, I
showed you a notation, you multiply by w and run it through a summation and you get ^y , this
was linear regression.
What we will see this week is logistic regression is a very small change over this, you take x,
again the same parameters w, run it through a summation and we add a one small change, we
add a nonlinear function. This is called a non-linear activation function and this gives our ^y ,
this is called logistic regression for certain choices of activation functions. So please
remember this name activation function, all an Activision function is after your linear
combination you add nonlinearity over this ok, so we will typically denote the non-linear
activation function by g ok so g stands for some non-linear function.
473
All this is put together let us achieve classification in a very simple way, you will see this in
the future videos. After this we will look at neural networks, we will also look at what are
called DNN or Deep Neural Network hence the name deep learning, and once again let me
give you a schematic of what this is and this is actually a very straightforward as well. So you
take your x, run it through a linear combination with some weight w, run it through a non-
linear function g ok. Then run it through another linear combination with some other weights
let us call them w 1, some other weights w 2, and another non-linear combination g and so on
and so forth Σ g and finally you get your output prediction ^y .
Now all this put together is called a Deep Neural Network that is it, it is a very simple idea.
Each of these combinations of Σ and g is called a layer, this is layer 1, layer 2, layer 3 so on
and so forth. So a deep neural network is the one which has more than one layer, so that is all
that is if you have more than 1 layer is called a deep network. So this in a nutshell is all there
is to learn with deep learning at least in terms of simple implementation, let me come to a few
more details that we will see as we move on through this week. You will see as we go ahead
that we need to pay attention in any model of this sort to the following factors; 1 st one is how
do we characterize the output y or ^y .
474
So as we saw in the last week, suppose the output itself is a number okay so the number for
example last time was we had our ⍺ which was the coefficient of thermal expansion or you
could have a house price or something that you are trying to actually predict, a number that
you are actually trying to predict as a regression problem. If the output is a number, it is easy
to say what y is. Sometimes however you are looking at slightly more qualitative thing for
example, success or failure of a machine part or it could be a classification for example, this
is a cat or dog or a horse okay so you have something of that sort.
Remember that what we saw in the 1 st week, every single thing that a machine learning
algorithm does maps one set of numbers to another set of numbers. So when I say
characterize, it means how do we assign numbers, so how do we assign to something a
number to something like a cat or dog. So we will come up with a very simple idea, for a
regression problem usually it is obvious, for classification problem how do we do for a
certain a few cases it can be slightly subtle so we will look at that so 1 st thing. Next is of
course what is the feed forward model okay, I have already spoiled that for you, I have told
you that for logistic regression all it is is for a linear combination followed by a non-linearity.
Which nonlinearity do we use that usually also plays a part, for a neural network it is
linearity-nonlinearity, linearity-nonlinearity so on and so forth that is usually what happens in
a neutral network ok. And auxiliary problem is which non-linear function ok, so how do we
choose g? We will look at some rules of thumb again there is no hard and fast rule, we will
475
give you some common choices that are available within the literature this week. The 3 rd
thing is what is the loss function?
So just to recapitulate you have ^y , how do I give numbers to ^y ? the 2nd thing is how do I go
from x to ^y given that I had decided some numbers for x, some numbers for y?, what the
forward model decides is what is this function form. Remember we distinguish between the
functional form and functional parameters so how do we take this functional form from input
to output that is what we have to choose next. The 3 rd you have to choose is, given that you
have ^y and you have some idea output y how do I get J ok, so that is the 3rd thing that you
have to decide.
∂J
Fourth, what you have to find out is how do we calculate ? in other words this is the
∂w
gradient problem ok. There is a 5th problem which will not be discussing very much which is
∂J
how do we use to find better w, as of now more or less what I am assuming is to be will
∂w
simply use some form of gradient design and as Dr. GanpatHi had told you last week, you
can actually use several variants of this, typically pure gradient design is almost always never
used in practice.
We use some option or the other which is a slightly modified version of gradient design but
for our understanding of the algorithm what we will split our problem into is finding these 4
things; a number representation okay characterized or a representation of ^y , so the
476
representation problem, the forward model, the loss function and the gradient. If you know
these 4 things, you have a deep learning model one way or the other you can always get a
deep learning model, so this is just an optimization problem after that okay. So I will just
request you to pay attention to these 4 as we go over logistic regression as well as deep neural
network, you will see these 4 once again in different forms when move onto the next few
weeks which will be what are called Convolution neural networks and recurrent neural
networks, but did you get these 4 during this week, you have a very good picture of what is
actually needed to setup a deep learning model. Thank you.
477
Logistic Regression
In this video we will be looking at our 1 st classification algorithm called logistic regression.
Note that even though the name regression is sitting there, it is actually a classification
algorithm. So in the previous video we saw linear regression which tries to fit an optimal
usually least-squares fit to some given data using weights which are linear. So now in this
video we will look at the idea of trying to classify two different sets of data for binary
classification. By binary classification, I mean there are two possible classes, we label these
classes simply 1 and 0, these are the labels that we are giving our classes.
So suppose you collect some bunch of examples and once you plot them suppose x 1 and x2
are two features of your data and you are classifying your data recording to this okay. Now
what you see when you plot the figure is that they are nicely clustered to one side or the other
side and intuitively we can draw a line separating the two, separating or classifying line. Now
can we use our linear regression idea in order to achieve some sort of classification of this
sort okay? So here is the problem, you are given x 1 and x2 which is your input x vector, you
have a bunch of examples, each of data points is an example, so as we use with linear
regression we will use the same notation so on and so forth.
478
Once again we will use m examples; this is our input and our output as before is y. Now
unlike linear regression, remember that our output has to be 1 or 2 classes, either in this case
the red class or the green class which is class 1 or class 0. So our label y should always be
either 0 or 1 always and we have to make our job is to get a predictive model which does the
same. It does the same, ideally it should classify this as 0 whenever this is 0 and as 1
whenever it is 1 and of course we would like to do it for new points as well okay for some
extra new point which has been now introduced, so how can we do this?
Suppose we try simple linear regression, there would be a problem, what is that problem? So
suppose I introduce a new point somewhere here ok, if I had fit a simple linear regression line
whatever model I use, if I use a linear model ^y =w 0 +w 1 x 1 +w 2 x 2. Regardless of your weights
w 1 and w 2 if I pick up a far enough point, this value will come very high it is not going to lie
between 0 and 1. Similarly if I choose a point at the extreme, there is no way for you to
ensure that a simple linear model will always give values between 0 and 1, so which is why
we use a very simple idea which is too squish all data to the range 0 1 which is what we
require as our output okay. How do we do this?
Once again there is a very simple idea for this, we use something called the sigmoid (𝜎)
1
function. So sigmoid function is defined as follows, 𝜎= . What does it look like? You
1+ e−z
can see its limits as z tends to ∞, σ ( z) tends to 1, because e−z tends to 0. As z tends to −∞,
σ ( z) tends to 0 because e−z tends to ∞ . Now exactly at z=0, σ ( z) is equal to 0.5. Notice also
479
that this is a monotonic function. So if you plot the sigmoid curve it looks somewhat like this.
At x=0 the σ is 0.5 and as x tends to ∞, σ is going to be 1, and as x tends to −∞ it is going to
tend to 0.
So what this tells us that if I predict ^y as instead of using what I had for linear regression, If I
use sigmoid of linear regression, this would tell me that ^y will always lie between 0 and 1.
Now this has an additional advantage which is now we can interpret ^y as the probability that
the output belongs to class 1 even the input x, let me explain. Now let us look at this line
here, we want this value let us call this something z, this value z=w 0 +w 1 x1 + w 2 x2, such that
if it has to lie in class 1 then z has to be really high ok. If it has to belong to class 0 we know
that z has to be really low ok.
480
Now suppose I look at a point which is somewhere here, let us say this is a new point ok,
what this tells me is that this point is almost certain to lie in class 1. Just looking at this data
the further away we get away from this classifying line then more and more uncertain we are
that this point belongs to class 1. Similarly, the further away we get to decide of the line, the
further and further closer certain we are that we belong to class 0 okay. So what we can think
of z as, is z is the perpendicular distance from the classifying line.
How does that help us? The classifying line then is the line z=0 which means σ ( z) is equal
to 0.5, if I come to this side σ ( z) becomes close to 1, if I come to this side σ ( z) becomes
close to 0, the closer I am to this line the more uncertain we are about where it lies, whether it
lies on class 1 or whether it lies on class 0. Therefore it is easy to now interpret ^y which was
equal to σ ( z) as the probability that you belong to class 1 okay for example, if your sigmoid
value is close to 0.5 okay then what it means is you are not really certain about how close it is
to class 1, it means probability is approximately 0.5 that it is class 1.
Let us see sigmoid is close to 0.99 okay then we know that it is really far away and that we
are very-very certain that it lies in class 1. Suppose the sigmoid value is 0.01 then we know
that the probability that it lies in class I is actually pretty low it is equal to 0.01. So this is the
simple idea behind logistic regression, we will see how to compute classifying lines using
logistic regression in later videos.
481
Binary Entropy Cost Function
In the last video we saw that we had a simple algorithm or a forward model for logistic
regression. We call that the forward model was ^y =σ (w . x), where w includes w 0 , w 1 , ... w n if
you have n features, and x here is x1 , x2 ,... xn. So this was our forward model. Now the
question is what is a good cost function for this? Remember, in our usual learning paradigm
what we have is, I have an x, this predicts a ^y , the ground truth is some y, and I wish to find
out some cost or penalty for ^y hat and y being different ok.
So now why not use the least square cost function? The least square cost function was simply
1
√ ❑, of course I’m taking this for one particular incident or one particular example, the
2
usual thing was we take the sum of all these examples and take an average that is what we did
for linear regression. So why not use this for classification. It turns out that this is not a good
model okay.
482
It is not a good cost function or at least it is not an optimal cost function for several reasons, I
will just mention one okay. Now if you recall in a binary classification problem for a given x,
the ground truth y and even ^y , all of these y is either 0 or 1, ^y lies between 0 and 1. Now
suppose you are doing a case where you are trying to distinguish between let us say
something as serious as cancer and no cancer or even if it is a cat versus a dog. Now, notice
that even if you totally misclassify ok, so for example y is 0 and let us say ^y is 1 okay, so you
are totally misclassifying for example, a case where the actual prediction or the ground truth
is the person does not have cancer and you say it is cancer. The cost that you incur for
misclassification, that is when y is 0 if you say ^y is 1 or close to 1 let us say 0.99, (we saw in
the previous video that ^y gives an estimate of probability that the prediction is actually or the
class is actually 1), so when we want to predict something as clear as classification and you
give a misclassification, the cost incurred for that is actually very low, that is we do not
penalize this cost high enough, even though there is a penalty it is not high enough. So
because of that, this is one of the reasons why the usual least square cost function is a bad
cost function for classification.
483
We instead use something called the Binary cross entropy cost function. The form of that cost
function is different. j=−( yln ^y + ( 1− y ) ln ( 1−^y ) ). Now we will come to the reasons for each
of these terms shortly including why there is a minus and why both these terms are sitting
there. So now let us think about some properties of the cost function that we want to have and
let us check whether this has it or not. So some desirable properties for classification cost
function, First of course is if J should be 0 if y is equal to ^y , this is the 1st thing that we have
to check. Second, J should be very high for misclassification, and the 3rd this is merely
required for consistency is that J should be greater than or equal to 0.
484
Remember when we had our least square cost function; least square is obviously always
positive. So let us take this step-by-step. I will start with here, notice ^y will always lie
between 0 and 1, it is a probability, so instead of saying that this person has cancer or not,
you will say something like the probability that this person has cancer is 0.9, that is going to
be the outcome of your logistic regression. Why is that? Because if you notice our ^y is
sigmoid of something and the sigmoid always goes between 0 and 1. Because of that ^y is
constrained to be between 0 and 1.
Notice that y is either a 0 or 1, it is not between 0 and 1 but it is either a 0 or 1, why is that?
Ground truth this is a supervised learning task, you already know whether this person has
cancer or not, this isx history or if you are trying to classify images (let us say cat and dog),
you have already a labelled set and it is based on that label set that you are training, so you
already have all these labels available.
So y is either a 0 or 1, ^y is between 0 and 1. Therefore ln ( ^y ) is going to be negative. y is

going to be either a 0 or 1, so this whole term yln ^y is negative. Similarly this term
( 1− y ) ln ( 1− ^y ) is also negative, and that is why we have minus sign so that the whole term
actually becomes positive. So that it is consistent with least squares, that is the first. Now let
us see, is J equal to 0 if y equal to ^y ? Okay so let us take a few cases.
485
y is 0 and ^y is 0, now the moment y is 0 and ^y is 0 or close to 0 (I will keep it close to 0 just
to avoid the singularity at exactly 0, why am I keeping it close to 0 because sigmoid is
1
actually never going to give you exactly 0, remember sigmoid is ), let us say 10−6, then
1+e−z
yln ^y =0 × ln ( 10 ) ≃ 0 and ( 1− y ) ln ( 1− ^y ) =( 1−0 ) × ln ( 1−10 ) ≃ 0 . J is approximately 0.

−6 −6
Similarly you can check, if y is 1 and ^y is close to 1, then yln ^y =1× ln ( 1 ) ≃ 0 and
( 1− y ) ln ( 1− ^y ) =( 1−1 ) ×ln ( 1 ) ≃ 0. So this condition is also satisfied, that is if you classify
correctly, your cost function is going to be approximately 0. Third, which is the main
property (these two properties are true of least squares also but this property is the one that
least square does not satisfy), so what we want is J should be as high as possible, in case you
have misclassified. So let us check that, I will check it just for one case.
So let us say, y is 0 but ^y is approximately 1, let us say 0.99 something or that sort. So just to
give you an example, suppose the person does not have cancer or the image is let us say a dog
and you end up saying that this is actually not a dog and I am very-very certain about it okay,
I am certain up till 99.99% that this is actually a cat. So you are actually misclassifying with
high probability, then what happens to the cost function? So let us take a look.
486
So in this case, yln ^y =0 × ln ( 1) ≃ 0, and ( 1− y ) ln ( 1− ^y ) =( 1−0 ) × ln ( 1−1) ≃−∞. So we have
got a minus sign here and J is positive. You are going to throw up a really high cost because
you have misclassified. So that is the trick that we are using. J tends to infinity as ^y tends to
1. Also you can check as an exercise that if y is 1 so the ground truth is 1 and if ^y is
approximately 0, J will again tend to infinity.
So the basic trick here is that in case you have a misclassification, you are going to throw up
a very high cost and in case you have a correct classification, you are going to get J equal to
0 or approximately equal to 0 dependent on how close you are to correct classification oka.
So this is what is called binary cross entropy cost function. So along with the least square
function these two are the main two lost functions that we will be using more or less
throughout the course. We will have a small modification to the binary cross entropy cost
function shortly then we take multiclass classification but then it is a very minor adjustments
to what actually is this.
So just two cost functions. Therefore regression, typically we use a regression type problems
we typically use something like a least square cost function and for classification problem we
primarily use something like a binary cross entropy cost function, thank you.
487
OR Gate Via Classification
In this video we will be looking at a simple example of logistic regression by trying to

represent OR Gate as a classification problem. So let me show you what I mean here. We all
know what an OR Gate is, it takes in two inputs, let us call them x 1 and x2 , which are always
0 OR 1. So let us call this X vector. So 0 ( ¿ ) 0 you see 0 and 0 ( ¿ ) 1 gives you 1. So even if one
of the inputs is 1 we get a 1. So it is a simple logic gate. Suppose we represent this as a
figure, let us take x 1 here x2 on this axis, so 00 gives a 0 and the others give us 1. So let *
represent 0 and let the circle the present 1 ok.
488
What we want to do is to find out an algorithm OR logistic regression which will classify at
least these 4 points correctly. You can see this as a binary classification problem where O is
one class and * is another class ok. Now intuitively we can see that if I draw a line
somewhere in the middle here, one side of the line will be classified correctly as * and the
other side of the line will be classified correctly as O, but we will try and do this
mathematically. So remember, we can represent our logistic regression as a simple neutral
diagram as follows, you have x 1, you have x2 , these two combine here gives summation Z is
equal to z=w 1 x 1 +w 2 x2 + w 0followed by a sigmoid and our prediction ^y is σ ( z ).
Finally we classify after this as our prediction is 1 if ^y is greater than 0.5 and it is 0 if ^y is
less than 0.5 (we can just arbitrarily decide that the equal to sign goes to 1). So the question
we are asking is, what weights w 0, w 1, w 2, all put together basically W vector will classify the
same as the OR gate. If we do that, we have essentially represented the OR Gate as a simple
neural network okay or as a simple logistic regression network. So let us try these values
here.
489
Now if I want this, let us back calculate this whole value and see if we can come up with
something. So if I want my classification as 0, it means ^y has to be less than 0.5 ok.
Otherwise, ^y has to be greater than 0.5.
^y is σ ( z ), we remember that the sigmoid function works this way, if z=w 0 +w 1 x1 + w 2 x2 and
^y =σ ( z ), then at 0 your σ ( z ) is 0.5. Whenever Z is greater than 0, σ ( z ) is greater than 0.5,
whenever that is less than 0, σ ( z ) is less than 0.5. So now let us now find out what Z has to
satisfy. If σ ( z ) has to be less than 0.5 then we know that Z has to be negative, similarly for
these three, Z has to be positive. Now we also know what z=w 0 +w 1 x1 + w 2 x2. So let us
consider this case, both x1 , x2 =0, we want Z to be negative which means automatically that
w 0 has to be negative. Also notice that without w 0 we could not have made this possible at all.
490
If you just have w 1 x1 +w 2 x 2 that is a simple linear combination of x 1 and x2 without this
biased term, you cannot make this case work out at all. So we know now that w 0 has to be
negative. We also know that w 1 and w 2 have to be positive to make these cases work. So what
is one set of weights which will work? Let us take a simple example, let us take w 0 is equal to
- 1, we come to this example it tells us that w 2 x2 is active, x2 is 1, I now know that w 0 +w 2 has
to be greater than 0, W0 is already -1, I can easily assign w 2 equal to let us say 2. Similarly
you can argue that w 1 can be made 2, so we have come to this set of weights.
491
Now notice something, we will come to the physical meaning of this weight soon but note, 1 st
of course the point I said earlier is essential, you cannot make this go away which tells you
the need for the biased unit. Second, notice that w 0 , w 1 , w 2 are not unique, this is very
important. In general, the result of logistic regression need not be unique. In fact you can see
this geometrically true, this classifying line here whichever line we draw arbitrarily this has a
lot of give, you can go back and forth and still make this work, you can make this inclined in
several ways.
Also this lack of uniqueness is due to 2 things, you can even multiply w 0 ,w 1 ,w 2 by a constant
and even that would not be unique okay, but that is a trivial sort of non-uniqueness, more
actual sort of non-uniqueness is the fact that this line can move back and forth, it can also be
translated as well as rotated a little bit and still the classification will work. So these two are
important point for us to notice. Now the line that we have got is z=w 0 +w 1 x1 + w 2 x2.
Remember, the interpretation of the classification line is that this is the line Z equals 0, which
is the line w 0 +w 1 x 1 +w 2 x 2=0 if you write down the values of W vector, we will get
1
x1 + x 2= , because w 0=−1and w 1 , w 2 =2.
2
So this is exactly the line that bisects these two okay, so that is the arbitrary set of values that
we have given here. You could also give the value for example w 0=−1 ,w 2=3 ,w 3=3, in
which case the classification line would be inclined at a slightly different angle. So this would
492
1
be still parallel to this but will be slightly here. So that would be the line x1 + x 2= . So this
3
tells you how logistic regression can be replicated, OR gate can be replicated using logistic
regression. In the next video we will see how the other gates can also be replicated using
logistic regression.
493
NOR, AND, NAND Gates
In this video, we look at a few more gates very very quickly the basic algebra for this case had
already been done in the previous video. So suppose you take the NOR gate. NOR gate is
essentially the opposite of OR gate. So I will quickly write the truth table. So if you have 00, 01,
10, 11, so it is exactly the opposite of OR. How would we make weights for this? I will let you
494
work through the algebra but essentially you can probably see that if you, remember that w 0=−1
for one possible set of values.
This was the OR gate. You can guess that flipping the weights and making w 0 as 1 and w 1 , w 2
negative of this, will give us NOR gate. You can check this quickly. If you do w 0 +w 1 x 1 +w 2 x 2
for ( x1 , x2 )= ( 0 , 0 ), the value will come out to be positive. Therefore ^y after classification will
come out to 1. Similarly in ( x1 , x2 )= ( 0 , 1 )∧( x 1 , x2 )= ( 1, 0 ) cases w 0 +w 1 x 1 +w 2 x 2will be -1, ^y will
0. In ( x1 , x2 )= ( 1,1 ) case you will get -3 and ^y will get 0 again. So this works out.
You can similarly check how the AND gate works. I will write truth table. So this works only if
both of ( x1 , x 2 ) are non-zero. And a set of weights that will give you this result, now you have to
weight the biased unit a little bit heavier so that it does not trigger a positive result even if one of
these is active. Lets take w 0=3 . So for ( x1 , x2 )= ( 0 ,1 ) case, w 0 +w 2 x 2=−1 and it will still give
^y =0. So this case is the AND gate, and you can easily guess that the NAND gate would simply
be the negative of these weights.
So we saw in this video that very simple elementary weights, elementary gates OR, NOR, AND,
NAND, all of them can be represented by the simple architecture, x1 , x2, a biased unit which is
simply 1 followed by our weights followed by the sigmoid followed by a classification. This
portion of course put together as we saw earlier can simply be called an artificial neuron. We will
495
write w 0 , w 1 , w 2. So the question, I will stop this video with this, can all logic circuits or in fact
can all logic gates be represented by this simple network?
We will see in the next video that this is not possible and since it is not possible, that is what
leads us to extra layers within this neural network when we have to consider things like XOR
gate.
496
XOR Gate
In this video, we will be looking at the XOR gate. So you might recall that in the previous videos
we saw that all elementary gates such as OR gate, AND gate, etcetera could be represented as
simple one layer, or two layer neural networks basically based on simple logistic regression. So
one of the first historical examples in the book, Perceptrons by Minsky et cetera, one of the first
counter examples for what all neural networks could do or what all elementary perceptrons could
do was the XOR gate.
So what is XOR? So XOR, we will write once again as a truth table, so XOR gives 1 or 2 as an
output. So I will just write it XOR here. If x and y, or x 1 and x2 are different, it gives 1 as output.
If both are the same, it gives 0 as output. So once again let us try and represent this graphically in
order to see what the problem is for logistic regression. So in case of 00 and 11, you get 0, and in
case of 01 and 10, you get 1. So we have two classes. This red class or the x class and the o class.
Now we know that the logistic regression the way it works is the z=0 line, which was the
w 0❑+ w 1 x1 +w 2 x2=0 would function as the classifying line. But unfortunately you can see that no
classification line here would work. Any line that you do cannot actually separate the x’s out
497
from the o’s. This case is called, this is not linearly separable. Please notice the word, linearly.
We can of course make classification boundaries that are non-linear, for example this. This
would be a non-linear classification boundary. But linearly these two classes cannot be
separated. So this is fundamentally the problem, which means since this is not linearly separable
you cannot use logistic regression in order to separate these two classes. However, we will see
that one simple trick makes it work. Let me now talk about how this problem can be solved.
So there are two possible solutions to the XOR problem. The first is to use non-linear features.
You have to be a little bit careful with how you use non-linear features. We will see later on in
the course this kind of idea uses something called the kernel trick. But very simply we will have
to start using other factors rather than just x 1 and x2 in order to create a new boundary whose
classification boundary would actually be non-linear function of x 1 and x2 and not just a linear
function.
There is another possibility which is to add extra layers. I will talk shortly about what this means.
This is what is called Deep networks. Technically speaking if you add more than two layers, that
is when you will get a deep network. But I will just call abuse notation a little bit and call this a
deep network.
498
Let me explain how this comes about. So we know that XOR of two inputs, x 1 and x2 can be
written as NOR of NOR and AND. That is XOR can be written as a combination of elementary
gates that we have already looked at. Recall we already have weights for the NOR gate, for the
AND gate et cetera, so which means we can in some sense write XOR in terms of those
elementary gates that we have already discovered. How would this look? Suppose I have x 1 and
x2 , as usual I add a biased unit which is simply 1, we can call it x 0 if we wish.
So this network here or this expression here now has NOR of x 1 and x2 first. So let us say I have
a NOR here. Now this is a combination of these three. Let us call these weights something,
remember for the NOR gate w 0=1 , w 1=−2 , w 2 =−2. I am going to erase this here for a
499
particular reason, otherwise the figure would get very cluttered. Now I also have an AND gate,
let us call this AND here. This is another linear combination of x 1 and x2 . So we need some
notation which we will use shortly. But remember that these weights, these new weights are
actually different. If you recall, the AND gates where we had used w 0=−3 , w1=2 ,w 2=2. These
were different from the weights that we used for NOR gate. Let me erase these two, we will use
some notation shortly which will make this whole thing clear. Now after these the output of these
NOR and AND goes to another NOR. So you have yet another NOR gate which needs yet
another biased unit. These weights once again were the same weights as this NOR gate.
So you can now see you have six weights here and three more weights here. We have a total of
nine weights. The output of this as we know is XOR. I will show this to you quickly in a truth
table so that you can convince yourself. ( x1 , x2 )= [ ( 0 , 0 ) , ( 1, 0 ) , ( 0 , 1) , ( 1, 1 ) ], I will just write both
NOR and AND as output here. So NOR of [ ( 0 , 0 ) , ( 1 , 0 ) , ( 0 , 1 ) , ( 1 ,1 ) ] is [ 1 ,0 ,0 ,0 ]. And the
second one AND of [ ( 0 , 0 ) , ( 1 , 0 ) , ( 0 , 1 ) , ( 1 , 1 ) ] is [ 0 ,0 ,0 ,1 ]. Now if I do NOR again of this output,
NOR of [ ( 1, 0 ) , ( 0 ,0 ) , ( 0 , 0 ) , ( 0 , 1 ) ] is [ 0 ,1 ,1,0 ], which is equal to the XOR here.
So let us call this something just to be clear y 1= [ 1 ,0 ,0 ,0 ] , y 2= [ 0, 0, 0,1 ] . NOR or y 1 , y 2 is

[ 0 ,1 ,1,0 ]. So this is the XOR gate which you get as this combination and we have now written
what I will call the neural diagram. The difference of this diagram from the previous diagrams is
it has an intermediate layer here which is the new invention in some sense. This intermediate
layer is called the hidden layer, that is because it is neither the input nor the output layer which is
all we had been working on before. This layer here is often called the input layer, and this layer
here is called the output layer.
Let us use some simple terminology. Because now we have 9 weights, we need to give some
terms, we can simply call them w 0 , w 1 , w 2 etcetera. So let us call this as unit 1, this as unit 2, I
had called it y 1 and y 2. We will make this a little bit more precise later on. So the weight going
from the first unit here or the zeroth unit here to this, I will call that as w 01, which is from the
biased unit of the input layer to the first unit of this layer. Remember there nothing that goes
from here to here, there is only from here to the first unit.
500
So the NOR weights were w 01=1 , w 11=−2, w 21 =−2. Similarly we have these weights
w 02=−3 , w 12=2,w 22=2, which were the weights of the AND gate. Now we have these weights
too which we have to represent which are again weights of the NOR and we need some notation
for that. Now you notice that we have run out of 0s, 1s and 2s. So we have to give some slightly
better notation. So what we will do is we will call this layer as layer 1. So layer 1 is the layer that
connects the hidden layer to the input layer. So these are layer 1 weights and there are six
weights in total. So what would be labeled our final layer weights as w (012 )=1 , w (012 )=−2 , w (012 )=−2.
So these are the weights of the second layer indicated here as a superscript. You can see
therefore that an XOR gate can be written as a neural network, we will call this kind of diagram a
neural network, each of these sitting here are neurons. So it can be written as a neural network
with one hidden layer.
It turns out, So there is a theorem called the universal approximation theorem which I am going
to define very loosely here that any function can be expressed as a neural network with one
hidden layer. So as it turns out this one hidden layer has tremendous expressive power. So if you
pre-specify your accuracy, if you have a functional relationship that you would like to express
and you want to say that to the accuracy of 10−6 I want this function to be approximated, it turns
out that there exists mathematical language, there exists one neural network which would do that
with a simple one hidden layer. What is not given is how many neurons will you need in this
501
hidden layer. As it turned out for XOR, we only needed two, we only needed two hidden
neurons. It might happen that for 10−6 accuracy, you might need billions of neurons for some
particular function. But what we do know is that it is always possible. And this is what makes
deep networks very very powerful tools. Now what would be a good approximation? What
number of neurons would you need et cetera, et cetera, that is always questionable. Now I would
like to come with, come up, finally end with one simple interpretation of what is happening here.
Let us go back to our original diagram.
We had these four points. When they were on a single plane, when they were on simply the xy
plane, it was not possible to separate them using a single line. As it turns out, one can interpret,
though not in the particular case that we have taken, one can interpret the addition of extra
hidden layers as if you are creating new dimensions in the problem. So I would like you to
imagine a specific case. Remember we had these four points, ( x1 , x2 )= [ ( 0 , 0 ) , ( 1, 0 ) , ( 0 , 1) , ( 1, 1 ) ].

Now just imagine the case where you introduced a new dimension into the problem and the two
axes were sitting here somewhere above.
That the old axis that we had in the problem were actually projections of these axes that are at a
plane little bit above. Now you can imagine that if I have a three-dimensional problem, I could
introduce a plane somewhere in the middle, that would separate these axes out from the o’s. So
this will require a little bit of imagination on your part, imagine these four points are there and
502
you lift these two x’s up. Now what was not linearly separable in a lower dimensional space can
actually be linearly separable in a higher dimension.
I am not going to prove this or even show this for the XOR gate but you can see this at least
geometrically. As it turns out, it is useful to think about every additional layer as sometimes
introducing new dimensions into the problem which are making problems that looked not
separable at least linearly before, separable later in a higher dimension. In future videos we will
see how to extend this idea of a simple hidden layer into a larger network. The XOR gate and the
elementary gates represent good cases for you to build your intuition on how neural networks
work. So I would recommend that you go over these examples a few times in order to understand
what is actually happening and why a hidden layer is hidden.
503
Differentiating the sigmoid
In this video which is a very short video, we just looked at the differential of the sigmoid
function. This is used very very often. So we will just do that very very quickly. So remember
that sigmoid of z is equal to 1 by 1 plus e to the power minus- z. So if I do d sigma d z, that is
what we call sigma prime of z. This is simply minus- 1 by 1 plus e to the power minus- z the
whole square, multiplied by the derivative of e to the power minus- z. So this is equal to 1 by 1
plus…So this can be written as e to the power minus- z by 1 plus e to the power minus- z.
This of course is simply sigmoid of z and this can be written as 1 minus- sigmoid of z. So this
gives us the relation that sigmoid prime of z is sigmoid times 1 minus- sigmoid. And we use this
a couple of times while doing back propagation. Thank you.
504
Gradient of Logistic Regression
Welcome back. In the previous videos we saw our forward model of logistic regression. We also
saw how we could use it to in order to simulate the OR gate, NOR gate etcetera. And during all
these cases what we were doing is we were sort of guessing for weights. Without actually doing
the gradient decent, sort of by heuristic sort of rule of thumb guessing we figured out some
weights for which we could replicate the OR gate. Now obviously that happened well because
505
we had only four data points and we just had to fit a classifying line between those four data
points. In general, of course it is not really possible to just guess for a good line, which is why
we look at how to calculate the gradient of the logistic regression which is will complete our
loop.
So remember for logistic regression we are dealing with input modes as usual, x1, x2, x3 and we
can have our usual bias unit sitting there. We have one output. Why only one output? Because
this is binary classification, so the output is either going to be 1 or 0 or in our prediction case y
hat which lies between 0 and 1. And all these put together let us split it into two. We have our
summation and then we have our sigmoid. This gives us y hat. At the end of the summation,
whatever we get is z and at the end of the sigmoid whatever we get is y hat.
Now the weights which we do not know are here, w not, w 1, w 2, w 3 and so on and so forth up
till w n. Remember just like in linear regression you could choose x 1, x 2, to the x 1 square, x 3
to the x 1, x 2, et cetera. So you can choose non-linear features as well. Okay, this is just as a
reminder. So what is missing in this picture? We do not know these ws. This is just the forward
model and we follow our usual procedure. You have the forward model. You guess for the ws.
You get a y hat, from the y hat you get a J. From the J you feedback using del J, del w.
And through gradient decent or some version of gradient decent to calculate all that and improve
guess for the ws. So which is missing here is of course this del J, del w. So I am going to write
the whole thing in a vectorial representation. You have x vector, it runs through a sigma using a
w, then through a non-linearity, through a sigmoid you get y hat. And this y hat is what gives you
J. So just for simplicity or clarity, I am going to represent this slightly differently.
Let us take x vector, I will show the sigma here. I run it through w, I get z. I run z through a non-
linearity and I get y hat. From y hat I get J. And what I want is del J, del w vector. So question is,
what is the expression for del J, del w vector? Now this is fairly straightforward. Let us trace this
back. Del J, del w vector equal to del J, del y hat times del y hat del z, times del z del w vector.
So this is sort of the dependency. J changes because w changes. Why? Because J changes due to
changes in y, y hat. Y hat changes due to changes in z and z changes due to changes in w.
In other way, when I perturb w, it will perturb z which will perturb y which will perturb J. So
that is what we are chasing down here. This is essentially the Chain Rule. When we come to
506
neural networks, you will see that it is exactly the same idea which is applied for what is called a
back propagation algorithm. So let us now calculate the expression for this. So let us look at each
of these terms individually. First, what was J? J if you recall, I had written as summation of y ln
y hat, plus+ 1 minus- y, ln 1 minus- y hat.
I am going to put an i here and i equal to 1 to m. And what is this i equal to 1 to m? It is very
very similar to what we did, in fact it is the same to what we did when we were doing linear
regression. Recall, suppose you have x vector, as in the example that we have shown here, and
you have multiple examples or multiple data points. So for the first data point x vector was x 1, x
2, up till x n where you have n features. And I am going to put a superscript 1 to say that this is
the first set of data points.
In the case of the OR example, we had four such data points. So you had an x 1 1, x 2 1 which
was 00 in that case. So similarly have x 1 2, x 2 2, x n 2. And we could have m such examples
very similar to the multiple linear regression example that we did. Now we have the ground truth
and since this is binary classification, ground truth here is y 1. y 1 will be either 0 or 1, y 2 will
be either 0 or 1 and you have y m. Then we have our y hat which is our prediction which is h of
x given the parameters w.
So you are going to have y hat 1, y hat 2 up till y hat m. Finally I am going to introduce
something new just for clarity. I can have J i, what is that? This is the loss just due to the ith
example. So going back here, suppose you had four points and let us say this value was 1. This
was y and y hat was let us say 0.8. The very fact that y hat and y differed will give you some
amount of loss. So J i is just the loss in the ith example. So you will have J1, J2, up till Jm.
Even in linear regression you can see if you have a line and you have multiple points, the loss for
each one of those points will be denoted as J i. So coming back here, all I am writing is J is the
summation of all the binary entropy losses from each individual data point.
507
So it is simply the summation of all individual losses. So suppose I want del J, del w, this is also
going to be, actually I will remove the minus, let us include the minus within the J i. This is del J
I, del w vector. So we have this del J i del w vector is equal to del J i del y hat, del y hat del z, del
z del w vector.
Technically speaking, this is y hat. Okay, for now I am going to drop the i in the future
expressions and we will just sum it back just for clarity of notation. Let us look at this term first.
So I am going to drop the i as I said. J is minus- y ln y hat, plus+ 1 minus- y ln 1 minus- y hat.
Notice that the derivative is with respect to y hat because that is what depends on w. y is fixed, y
is what we gave, y are the labels that we give. y hat is the parameter that, y hat is the hypothesis
function that we get out of the prediction using our parameter w.
So what is del J, del y hat? Del J del y hat, this term, is minus- y by y hat, minus- 1 minus- y by 1
minus- y hat. So we have that expression. What about del y hat del z? Del y hat del z is,
remember y hat is simply sigmoid of z, that is how we calculated it.
508
If you come here to this picture, y hat was calculated as g of z or sigmoid of z.
Now luckily in a previous video we already calculated this derivative. This was simply sigmoid
of z times 1 minus- sigmoid of z, which is the same as y hat times 1 minus- y hat. So if you put it
together, then you get del J, del y hat multiplied by del y hat del z, is equal to minus- y over Yes.
hat, minus-, 1 minus- y over 1 minus- y hat, times y hat times 1 minus- y hat.
509
And if you calculate this, you get a very simple expression which is simply minus- y minus- y
hat. Please check this for yourself. So we can call this expression even del J del z by the chain
rule. Del J del z is equal to minus- y minus- y hat. This is simply the negative of the error. That is
if you made a prediction y, y hat and the ground truth was y, del J del z is simply minus- y
minus- y hat.
510
We have one other term left in this expression which is del z del w hat. Remember z was w
vector dotted with x where w vector includes w not. This expression we saw the augmented w
and the augmented x. x now includes 1, x 1, up till x n. So this can be written as w transpose x,
also as x transpose w.
We saw in the linear algebra videos that in such a case when you have del J, del w where z is
equal to x transpose w, this means del z del w is simply equal to x vector. You can do this by
algebra also or you could do this using the notation that I used in week 1 of this course. So if you
put these two together, you get del J del w, equal to del J del z, multiplied by del z del w. This is
equal to minus- y minus- y hat, multiplying x vector, which is minus- the error multiplied by the
input. Now there are several noteworthy things here. I will go over them one by one.
511
First I will the general expression, this of course is del J i del w. So if I look at del J del w, this is
going to be summation from i equal to 1 to m of y i, minus- y hat i, multiplied by x vector. So for
example, if you wanted del J del w not, you will get y i minus- y hat i, multiplied by x not which
is simply 1, so on and so forth. For each component of w, you take the corresponding component
of x. This is a vector, that is also a vector. So this expression is for logistic regression with binary
cross entropy. Now some of you might recall that we had exactly the same expression, except for
the factor of 1 by m which we can either include or not, exactly the same as linear regression
with least squares.
So please refer back to your notes and check whether this is true or not. So this is actually
remarkable. You get the same expression for the gradient for logistic regression with binary
cross entropy as you get with linear regression with least squares. This is true even when you
start including regularization. So when you include the regularization of, as Dr Ganpathi had
shown earlier, if you include regularization of lambda norm of w square, that would also add
normally within logistic regression also.
The same thing holds true even for neural networks. So but the expression is actually exactly the
same. This does not mean that the J is the same. Remember that this y hat means different things
for linear regression and for logistic regression. For linear regression y hat was simply w
512
transpose x, for logistic regression y hat is now sigmoid of w transpose x. So please do
remember that.
The second noteworthy thing here which is not obvious is that for logistic regression the loss
function, J is not convex. So I will not go into what a convex set means but you can think when
we had least squares generally the cost function will look simply like a paraboloid. So this is
what is called a convex function where you have only 1 minima. The most important thing for us
to know is that the minima, the minimum is in general not unique. We can understand this kind
of intuitively too.
513
For example, we were trying to classify the set. Now notice when we were doing OR gate, we
chose one line which was like this. But there is no reason to choose only this line. Even this line
functions as a classifying line, this line functions as a classifying line. So there are many possible
classifying lines. So logistic regression does not give unique solutions. So in our whole process
depending on which w you start with, remember that our initial sets of weights we were guessing
randomly. So the random guess depending on where it is, you might get one classifying line or
the other.
So logistic regression actually depends on your initial conditions which classifying line you get.
So that is an important constraint. This is also a problem with neural networks. It will never
necessarily, does not necessarily give the global optimum. So in gradient decent let us say you
have two local minima. At one place your gradient is 0, at another place also your gradient is 0.
For example, if you have something of this sort, it is quite possible that your gradient decent
comes here and gets stuck and does not move from there rather than come here which would
actually be the local optimum.
So this is true whether it is logistic regression or whether it is neural networks which we will see
shortly in future videos. But for linear regression typically if you converge, you will converge to
only the global optimum because there is only one optimum because it is a convex function. So
514
this is important for you to remember that logistic regression does not necessarily deal with
convex loss functions.
So in order to sort of solidify your understanding of the logistic regression process, let us now
look at a code which does logistic regression. You will see that this is just a small modification
of the linear regression code that we had already used earlier. We had written a generalized
linear regression code.
515
So since the gradient expression for logistic regression is practically indistinguishable from that
of linear regression we will use the same code in order to do logistic regression. I will take the
simple example of the OR gate.
And in this case we will start with once again our example, 1, 2, 3, 4, we will treat it as four data
points. And x vector which was 00, 01, 10, 11 and we have our ground truth which was 0, 1, 1, 1
and we put the same model as before, a summation followed by a sigmoid gives us y hat after
taking in x. Now what I will also be doing is both in code which you will notice as well as in the
expression here, we will write it as if we are writing vector expressions just to tell you how that
works.
So if you have x 1, x 2, and x not here which is just 1, we sum these three up, w not, w 1, w 2.
What comes out is y hat after you put sigmoid and g. You can write this as z which is the
intermediate output of this, z is equal to w dot x. Remember this w also includes the w not which
is the biased term and x also includes x not. And y hat is equal to g of z. This would be the
vectorized expression. And this is what lets us write a general code with great amount of ease
because suppose I decide to use other non-linear features.
For example, you saw the XOR gate example, in the XOR gate example suppose you want to
include x 1 square, x 2 square et cetera, you could simply include extra features x 3, x 4 and that
would logistic regression would still work as it is.
516
So we will look at a vectorized code for this in the next video which would be the code. Thank
you.
517
Generalized Function for Logistic Regression
Welcome back. We will be doing a repeat of the linear regression example except with the
logistic regression case and a slightly different example. So what I have written here is a very
short code which is an OR gate. So notice the OR gate is now given as four different examples:
00, 01, 10, 11. And I have given the corresponding ground truths here, y here is simply the
ground truth, for 00 it is 0, 01 it is 1, 10 ti is 1, and 11 also 1. And like we did in the linear
regression I have given a learning rate, I have given a reasonably high learning rate of alpha
equal to 1. And epsilon, the stopping criteria, let us make it a little bit smaller, may be let us
make it 10 power minus 3, just to sort of see how the whole thing proceeds. This is obviously not
good enough in general. So we will see how it works.
518
Now I will show you the code with the comments. But the one I will be running is here, the one
without the comments.
So the general logistic regression code you will notice (this is there in the week 5 material on the
NPTEL website, you can copy it from there), it is more or less identical to the linear regression
code.
519
So it is remarkably identical, you will see this term is the same, this term is the same.
The only place we start differing is here. Earlier in linear regression we had only this term, w 0 up
to w n xn which was simply w ⋅ x. There is a small typo here, please ignore that. So we have
σ ( w ⋅ x ) which is sitting here.
520
I have defined an inline function for sigmoid and the only thing that changes from linear
regression is instead of having just z which was the output for linear regression, you have σ ( z ) as
the new hypothesis function.
The other change is in calculating the loss function. The loss function is now the binary entropy
loss function.
521
Otherwise the error terms et cetera remain exactly the same. In fact even the gradient terms
1
remain the same. You may include or you may remove the term.
m
I have chosen to include it just so that I could use the same code, this is just a lazy man solution.
1
If you wish, you can remove the .
m
522
Otherwise the gradient term is calculated in exactly the same way and finally we calculate the
loss once again based on the sigmoid.
So let us run this code for the OR gate and see what it gives. You will notice these four points.
These are our original data points and the classifying line actually moves from an incorrect
classification slowly up and technically speaking this is already correctly classified but we have
set a certain stopping criteria so it will move till that stopping criteria is met. Notice also that the
523
loss function J is continuously decreasing. So please notice this. You will notice that as the loss
function decreases you can hardly notice any change in the classifying line.
You will also see that the classifying line is not quite the one that we had given theoretically
which had gone through 0.5. As I mentioned in the previous video this actually depends on what
your initial conditions are. So if your initial conditions are different, you might get a slightly
different final classifying line. The whole point of this exercise, so this is obviously a very
simple exercise with the OR gate. You can try it with AND gate if you wish and you will get a
slightly different classifying line.
So if you try with different initial conditions, you will get different classifying lines. The
advantage of logistic regression is it will give you an answer. Of course, it will give you some
local minima, it might not, it might be good or it might be bad. With four data points it turns out
that it can be reasonably good with this. Now if I wanted to do the AND gate example, all I
would need to do is change these. If I change this to 0, this to 0 and this to 1, play with this
which I would recommend for you, please change all your data points and check what kind of
gates it classified, in fact I would even recommend that you try and do the XOR gate which is
simply this one change to 0 and see what it does.
You will see that in the case of the XOR gate that J actually saturates very early, it does not keep
on decreasing and it gives an incorrect classification, because as we had discussed in the XOR
video, XOR is not linearly classifiable. So it is not linearly separable data. OR gate happens to be
linearly separable. So this is a simple code. I would encourage you to play with it, look through
the code and kind of compare this with the linear regression code and see what you can notice
overall. Thank you.
524
Multinomial Classification Introduction
In this video we will be extending our binary classification which we did using logistic
regression into the general problem of multinomial classification. Multinomial classification
is simply the case when you have (more than 1 classes) more than 2 classes.
For example, if you have a case where you want to classify digits, Ok. We will see
525
examples of this later. So let us say you have all your digits from 0 through 9. So this is a 10
class classification problem.
Somebody has handwritten digits and we have to find out which digit it is. That would be a
10 class problem. So k equal to 10 in such a case, Ok.
So when we try to solve such problems there are primarily a few things that we have to do
over and above, over and above what we did for the binary classification problem, Ok. So
what we need is the following.
526
First we need to know how do we represent the output class.
For example in binary classification we simply decided if it belonged to one class we would
label it as 0, if it belonged to another class we would label it as 1.
Let us say we are dealing with a case where our 3 classes are something, we are trying to
label images, and the 3 classes are horses, cats and dogs.
527
You have several choices. Obviously you cannot label it simply as words, as we have
discussed in several videos before. You need to give a numerical name.
So one choice is to simply call this class 0, class 1, class 2. That is one possibility. We look at
one other solution which is called the One hot vector.
So we will be looking at this. The second problem is what happens in the final layer?
528
So recall that in your binary classification task we used the sigmoid, because it neatly gave us
a number between 0 and 1. And if it was close to 0, we knew it belonged to class 0 and if it
was close to 1 we knew that it belonged to class 1.
Now what do we do in a multinomial classification case? So we will be looking at a function

called the Softmax function, which actually corresponds very well with the One Hot Vector.
So we will use that for multinomial classification.
Finally we need to answer how we are going to calculate the loss or cost function for this
case. We looked at the binary class entropy as the cost function for binary classification.
529
It turns out that we will use something very very similar even for the multinomial
classification. So in the videos that follow, we look at one hot vector, we look at Softmax,
and we look at how we are going to look at the loss function for multinomial classification.
530
Multinomial Classification One Hot Vector
In this video we will be looking at the One Hot Vector representation. Remember that the
One Hot Vector representation is used for a case where you have more than 2 classes. You
can also use it for the case when you have just 2 classes but it is generally overkill to use a
One Hot Vector.
531
Let us start with a binary classification task and then we will go to a multinomial
classification task and see how a One Hot Vector works. So let us say you have 2 classes, you
have an image, and this image is that of a cat or a dog.
And I want to represent the output. The output says either cat or it says dog.
532
So earlier we would simply call cat as class 1 and dog as class 0 or vice versa. Either way it is
a simple number. Now in order to generalize for the multiple class cases we are going to
represent our output instead as a vector.
So the vector is going to have 2 elements, Ok. So in the ideal case instead of calling cat
simply 1, I would call it 1 0,
this would represent cat. And I could call dog as 0 1. This would represent
533
a dog. Essentially first element represents the cat;
the second element represents the dog, Ok.
Now how does this help? Let us go one more step. Let us say we have 3 classes. Let us say
we have cats, dogs and horses. Once again the One Hot Vector then looks like 0 0 1 for a
horse,
534
0 1 0 for a dog (this is of course arbitrary, you just have to make consistent choices), 1 0 0 for
a cat.
535
y1 , ^
In general if our prediction is ^y , ^y is going to have k elements, ^ y 2,..., ^
y k.
So k was 3 in our cats dogs horses example. What is the meaning of each of these terms? Let
us think about this carefully.
So in my cats dogs horses example this is probability that my output belongs to class 1, Ok.
536
Now my ideal output would be simply 1 0 0, 0 1 0 or 0 0 1, which is why it is called One Hot,
that is only one element is 1, all the others are 0.
That is why it is called One Hot Vector. All the others effectively are cold.
537
However we know through logistic regression that we are not going to get precisely 0 or 1.
We will get some number. This has to belong to some number between 0 and 1.
What our actual prediction will typically look like something like this, 0.9, 0.01, 0.09,
something of that sort.
This would represent to us, if I have a number of this sort, that this has
538
a probability of 0.9 then it is a cat, it has a probability of 0.01 that it is a dog, and it is a
y k represents the
probability of 0.09 that it is a horse, Ok. So each element here, the element ^
probability, Ok.
I will first write it in words and then I will write it in more mathematical term, belongs to
class k.
More precisely this is probability that y k =1 given whatever our input image, Ok. x here
denotes the image in the example and in general it would denote whatever input we have
here.
539
So this is the One Hot Vector and as you can see it can generalize from a binary to a k class
problem. We will be using this for multinomial classification.
540
Multinomial Classification Softmax
Welcome back. We will now look at some further details of multinomial logistic regression
or classification, multinomial classification algorithm. Remember that multinomial logistic
regression deals with when you have k greater than 2 classes.
So in the last video we saw that in order to represent, remember we had talked about 4
different things that we need to do
541
in order to establish our deep learning model.
The first thing is
representation of y hat. This we can do with the One Hot Vector.
542
So in case, k=3, ^y will have some 3 numbers. Ok suppose it is something like 0.75, 0.1,
point 0.15
and y itself could be ( 1 , 0 , 0 ) or ( 0, 1 , 0 ) or ( 0,0 , 1 ).
543
So you have something of this sort which represents ^y . Now what we need to do next is to
find out what is the nonlinearity that will achieve the classification.
So let me briefly point out why this is important.
So let us say you have some input x. For the sake of this example let us say x vector is an
image. Let us say it is a 60 × 60 gray scale image,
544
which means x (as I have repeated many times) can be written simply as 1 unrolled single
unrolled vector which goes from x 1 to x3600 .
So we will just represent this as x 1 up to x n.
545
So all these circles representing different components of this vector, Ok. Now I have that,
now I am going to do the same thing that we did before, x goes through a Σ
a summation and we want to classify this image,this grayscale image as one of three classes.
Let us say I will take the same example I have done several times, or I have talked about
several times. Let us say this is an image which we know is either of a cat or of a dog or of a
horse. You can think of
546
several engineering examples also.
But let us say we use this because they are immediately clear to us. Ok, suppose we have to
do this we need a ^y here
and ^y now is going to have three components;
547
^y 1 , ^y 2 , ^y 3 as I have shown above, Ok.
Ideally you would like, you know only of these to be 1, but as we have discussed several
times, what you are going to get is actually some number between 0 and 1 for all these three.
Now what is the property that you would like ^y to satisfy? I had already discussed before that
each of these is a probability.
So if I get something of this sort I will say that the probability that this image is a cat is 0.75,
the probability that it is a horse is 0.1 and the probability that it is a dog is 0.15.
548
That is the way I would like to interpret my ^y , Ok.
k
So if I want to interpret it that way then what do I need? I need that ∑ ^y k =1 Ok. This is
i=1
required in case
I want to interpret it in this way like a One Hot Vector. There are other ways of doing it but
this is the one that we will stick to for this course, Ok.
This is what we would like to do, Ok. Obviously it also means that all ^y k should also lie
between 0 and 1. We do not want them
549
to be either negative or even greater than 1. That is what we would like to achieve. These two
conditions we would like ^y to satisfy.
Now remember that before it goes to these 3 outputs you have a w, w matrix actually. What
is the size of this matrix? So suppose I ignore the bias term,
Ok, for now suppose I ignore this term which is the constant term, then you will see that each
x, so w has to take in x and give out ^y .
550
x is of the size 3600×1, y is of the size 3 ×1.
So what can you do in order to take this 3600×1 to 3 ×1? You need a weight matrix that will
be of what size? Ok, this is going to be of the size
551
3 ×3600. Why? Because then w 3× 3600 and x3600 × 1, will give you ^y 3× 1 Ok.
So let us put that in here.
552
All these get together through w, there is a summation
that gives you ^y . Would this be sufficient?
Obviously not because if I take some general weight matrix and just pre-multiply it by x,
there is no guarantee that these two conditions would be satisfied. This is the same problem
that we faced while doing logistic regression also,
that is w ⋅ x is of the right size but now I am not sure that when I simply apply a linear
combination that it is going to give me a number between 0 and 1. Which is why we use a
squeezing function just like we did in logistic regression.
553
So in logistic regression we use the simple squeezing function. The squeezing function was
sigmoid. And sigmoid gave us between 0 and 1.
Now we could think why not do the same thing here? Ok.
So I have w ⋅ x, so if I apply sigmoid of σ ( w ⋅ x ) this will also give me a 3 ×1
vector, each of these numbers will be between 0 and 1. So notice the operation I am doing. I
find out z=w ⋅ x, then I do σ ( z ), this will also be a 3 ×1 vector.
554
Each of these numbers will be between
0 and 1. Now why not use that? There is one small problem. The problem is this will not be
satisfied?
So if you arbitrarily apply sigmoid to 3 random numbers you are ascertain or you are not
certain that the sum of those numbers will always stick to 1.
So what do we do? We do a simple function called the Softmax function.
555
z i)
e(
softmax ( zi ) = k
So the Softmax function works in a very simple way.
∑ e( z ) j
j=1
556
So it is simply normalizing the exponentials of all these components, Ok.
So let me show this in a simple way. So suppose you have x3600 × 1, you apply w 3600× 3, you get
z3 × 1. Remember w T x becomes 3 ×1.
And now you have 3 numbers z1 , z2 , z 3.
Our problem
557
of course is z1 , z2 , z 3 are not between 0 and 1. So what do we do? We say ^y is equal to

Softmax of these 3,
which is the same as softmax ( z1 ) , softmax ( z2 ) ,softmax ¿ .
558
z 1) z 2)
e( e(
What does this do? This is equal to softmax ( z1 )= ( z1 ) ( z 2) ( z3 )
,softmax ( z2 )= ( z1 ) ( z 2) ( z3 )
,
e +e +e e +e +e
z 3)
e( .
softmax ( z3 )= ( z1 ) ( z 2) ( z3 )
e +e +e
559
z1 )
e(
You will notice automatically that both our conditions are satisfied, Ok because ( z 1) ( z2 ) (z3)
e + e +e
is always going to be between 0 and 1, Ok, since the exponentials are positive functions, it is
always going to be between 0 and 1.
Another thing is the sum of these 3 should be 1.
3
So we also get the condition that ∑ ^y k =1.
k=1
560
So both our conditions are satisfied. Some of you might recall that in week 3 we had seen that
the practical computation of Softmax you have to be a little bit careful.
If you compute the
numerator and the denominator separately as I have shown here, sometimes you might run
into overflow problems. We had also looked at a solution to that within week 3 itself. So I
would ask you to look at that in case you have forgotten it.
So just recapitulate what we have done in this video. It is a very simple idea. In case you have
One Hot Vector
561
as a classification representation of your final output, all you need to do in the final layer or
in the layer after the linear combination is to add a Softmax, Ok.
So once you add that Softmax you get a proper classification and this is your forward model
for
562
the multinomial logistic regression case. So recall that we had looked at 2 things. The binary
logistic regression, in this case you have 2 classes,
your ^y typically it is easier to just represent it as a scalar, a 0 or a 1,
563
And we have our binary cross entropy loss function which was −[ yln ^y + ( 1− yln ( 1− ^y ) ) ].
And then you have multinomial logistic regression where k is greater than 2. ^y now is a One
Hot Vector
564
and the nonlinearity we use here is Softmax. The nonlinearity we used in binomial logistic
regression was sigmoid.
Now what do we do about J?
565
So that is the last problem that we have to solve here. As it turns out that this is also fairly
straight forward. I will write it down right now.
K
The cost function for the multinomial case is −∑ y k ln ( ^y k ). This is it, for k=1 ,... , K classes.
k =1
566
Now you might think what happened about you know this ( 1− yln ( 1−^y ) )? Why is this
looking slightly different from here? This is also a cross entropy loss function for k greater
than 2. Now what happens at k=2?
567
I want to show you that the binary cross entropy loss function actually becomes equivalent to
this in the case of k=2. So let us say you have ^y in the case of k=2 and we represent it as a
One Hot Vector. This is ^y 1 and ^y 2, similarly y is y 1 and y 2.
568
Now if it is a binary problem, it is either this or that. Therefore ^y k has to be equal to, or let me
say this way ^y 2 has to be equal to 1−^y 1.
569
Similarly y 2 is equal to 1− y 1.
2
So if we run it through this formula we get J=−∑ y k ln ( ^yk ), which simply becomes
k =1
J=− y 1 ln ( ^y1 ) − y 2 ln ( ^y2 )
570
and from these two relations this is simply J=−[ y1 ln ( ^y 1 ) + ( 1− y1 ) ln ( 1− ^y 1 ) ],
which is the same as the binary cross entropy loss function.
571
So this is just to say that this is a general formula. You can think of all
classification loss functions in this form or at least the cross entropy loss functions in this
form.
To summarize, so far we have looked at the forward model and the loss function for logistic
regression as well as for the multinomial logistic regression.
In both cases, all we have is a linear function followed by a nonlinearity. When you repeat
the
572
same thing multiple times you essentially get a deep neural network as we will see in the
following videos. Thank you.
573
Schematic of Multinomial logistic regression
Welcome back. In the previous videos we have seen how to use logistic regression for
multiclass problems. We had done that using a Softmax function if you remember. We had
also looked at what the corresponding
loss function was, etc.
574
In this video I want you to see a simple schematic which will also tell you how exactly a
matrix comes when you deal with weights with multiple classes, when we have multiple
classes in multinomial logistic regression.
So let us consider a simple example. Let us say I have 3 input features.
And let us say I have 3 output features also, Ok, ^y 1, ^y 2, ^y 3.
So let us say this is a 3 class classification problem. You can think of multiple examples for
this.
575
For example if I give height, weight and age, suppose you want to find out whether this
person has no probability of heart disease or low probability of heart disease, medium
probability of heart disease or high probability of heart disease. This is not quite a
classification problem but just as an example I can give you this.
We will look at several examples or at least a few examples in the examples week which will
be around week 9 or so. So you can think of any convenient example for yourself. And now
let us introduce our usual bias unit which is 1 or x 0,
and now we want to find out what is ^y 1 , ^y 2 , ^y 3.
So the portion that we are doing right now is the forward model.
576
So as usual ^y 1=softmax ( w0 +w 1 x 1 +w 2 x 2 +w 3 + x 3 ).
That would be ^y 1. Now suppose I have ^y 2.
Now ^y 2 is also Softmax of some linear combination, let us say

^y 2=softmax ( w 0 +w 1 x 1 +w 2 x 2 +w 3 + x 3 ).
577
Now suppose this w 0 , w 1 , w 2 , w 3 were the same in both these cases, obviously you are going
to get the same ^y 1 as well as ^y 2. Because otherwise the functions are identical.
So this is not a good idea. So you need different weights. So we are going to use different
weights here, Ok.
So we need some terminology in order to distinguish these two weights. So I will call it
w 01 , w 11 , w 21 , w 31, where the 1 stands for the output and the 0, 1, 2, 3 actually stand for the
input.
Similarly you can easily see that now this should be w 02 ,w 12 ,w 22 ,w 32.
578
Finally if we come here, I need another set of weights.
So ^y 3 would be ^y 3=softmax ( w 03 + w 13 x1 + w 23 x2 + w 33 + x 3 ).
579
So how many weights do we have? 4 unique weights in each one of these, so you have 4 into
3,
12 weights in order to account for the bias term also.
So how would we write this matrix wise? So we have x3 × 1, we have ^y 3× 1 and we have w
which is now a weight matrix.
You can see w has 2 indices.
580
w ij has i as the input feature, j is the output feature. You had also seen in the earlier video
with XOR that you could have more than 1 layer. In that case typically
we add an l here which denotes the level. So you could have w (ij1 ) , w(ij2 ) , w (ij3 ), etc.
So you will have multiple weights. So this is the large number of weights as you will see in
the videos in the next week when we come to convolutional neural networks, you have
millions and millions of parameters in usual practical neural networks that sit in today which
is why they are extremely powerful, Ok.
Coming back to this, if we want to write ^y asw × x,
581
so just for this case I will make x 4× 1 so that
the 4 includes our bias unit also.
So you can write ^y =wx. x will be 4× 1, ^y will be 3 ×1. So if you want an appropriate w,
please imagine what w should be? this should be 3 ×4.
582
In the general case, if ^y is k ×1 where k is the number of classes,
and x is n ×1 where n is the number of features, then w should have the size k ×n, Ok.
583
Now there are some people who will denote this w as w T so as to be consistent with the
notation I have used here.
So you might see this at multiple places sometimes, you will see wx, sometimes you will see
w x , sometimes you will also see w x+ b where b the vector of biases.
T T
584
So just to clarify this notation for you, please notice, if we remove the bias separately this
will become b1, this will become b2 and this will become b3.
b vector is separated and wx is separated in such notations.
585
So we will be using this kind of notation, as I said before we will be using this
interchangeably especially when it comes to future videos and future weeks, thank you.
586
Biological Neuron
Welcome back. In this video we will be looking at the basic idea of a biological neuron. The
reason for talking about this is historical.
As we saw in the last videos, when we looked at the XOR gate,
we required one extra layer between the input and the output in order to actually represent the
XOR function accurately.
587
Now as it turns out, even before the XOR gate example lot of people had the idea that if they
could somehow replicate the functioning of the human brain or at least of the neurons in the
brain that they could somehow replicate all of human thinking.
Now before I discuss the biological neuron I will do it very very briefly, but I want to point
out that this analogy is in my opinion at least, and in the opinion of Doctor Ganapathy also, it
is not quite sound. Several people say this neuroscience is completely different area and even
computational neuroscience is very very different from what we do in actual neural networks
which we will be seeing in the next few videos.
Computational neuroscience works on extremely complex phenomena and they try to

replicate what we will be showing here
far more accurately than what we will be doing. When we show the artificial neuron which
we will do in the next video, that is just going to be a toy case.
And the reason why it works is basically mathematical rather than anything to do with a
physical analogy between the brain and the neural network.
Despite it being very popular in the media to talk about neural networks as if they were
simulations of brains and this is certainly a buzzword today, we think that this is not accurate.
588
Much like birds fly and aeroplanes are not actually trying to replicate what birds do, you
know, you do not have flapping wing aircraft even today. But nonetheless, you can take some
inspiration from birds. But what you are looking at is the basic principle.
Just like for bird flight or for aircraft flight, the basic principle is whatever lift the wing
generates, that has to be balanced by the weight of the body, whether it is an airplane or a
bird. Similarly we are trying to replicate some very very very rudimentary principles in
learning, which is what we will try to do.
So to come back to a biological neuron, I am certainly not a biologist so this image roughly
represents what happens in a biological neuron.
So this is the neuron cell
at the cell body and what happens whenever we get an input whether it is through the eyes
(which Doctor Ganapathy will be covering in detail in the next week, you know how, you
know our visual system works)
but whether it is the eyes or your auditory system, finally when it goes to the brain you
actually have multiple source of inputs.
589
Let us take the analogy of an eye. Just like we, in an image you have pixels, a lot of
information comes from various sources.
So we can see input nodes,
these are called dendrites, so all these finally feed in, into one single neuron.
590
It is possible that, you know a single neuron can have as complex a (feature) function as
recognizing a specific face, say that of your mother or something of that sort.
So in that case you will get all these inputs from your mother's photo or from your mother's
picture coming in from each input cell you can think of this as analogous to the features that
we have
in our usual diagram.
And all these feed in into one place. Now when all of them feed into one place we can think
of that operation as if it is a summation.
591
Now what you do, what the neuron apparently does is when all these come, the electrical
signal either fires or it does not fire, or it fires somewhere in-between, again I am not a
neuroscientist, so I am just going to give an approximate picture of what happens. So now all
these things come in and then, may be below a certain threshold the neurons so to speak does
not wake up, and above a certain threshold it wakes up, or it fires.
You can think of this as if you know somebody is shaking you in order to wake you up and
below a certain threshold if they do it very, very lightly you will not wake up but if they
shake you heavily you will wake up. Similarly the neuron activates
592
only after a certain amount of electrical signal, all these inputs coming together activate.
You will notice that this picture is somewhat similar to the sigmoid
that we had when we were dealing with logistic regression. So as it turns out, this simple
combination is what more or less defines a neural network.
This is our abstraction or cartoon picture of how a neuron works, which is a lot of inputs
come through the dendrites, they come into the neuron cell, they sum up there (which we also
did, whether we did a linear regression or logistic regression. We sum it up), and then it either
activates or it does not, or it activates intermediately which is what we did with the sigmoid.
Now after that the output is called an axon.
593
Now the output could be single or it could be multiple, depending on whether you have one
class or multiple classes when we are doing neural network.
And what connects it in between in neuroscience terminology is what is called the Myelin
sheath.
What happens with Myelin sheath is that they get thicker and thicker as you generate
memories which are stronger and stronger.
594
That is, if you repeat an action, whether it is writing, running, cycling etc the sheath tends to
thicken as you do the same activity again and again.
Now all that abstracted out, what I want to point out is this, a simple abstract structure for this
whole process, which is what we will be using for the rest of the course, which is that of
multiple features as input
coming up together, summing
and then you have some kind of activation.

595
We will repeat this abstract picture in the next video where we will look at the idea of an
artificial neuron instead of a biological neuron. Thank you.
596
Structure of an Artificial Neuron
In this video we will be looking at the structure of an artificial neuron.
So the artificial neuron is a simple abstraction of a biological neuron.
It is supposed to abstract out all the details that actually exist in the biological neuron and just
give you whatever is usable. Please do remember that this is a simplification and this is not
how actually our brain neurons work.
597
In this video the topics that we will be covering are what goes into an artificial neuron and the
two operations. Really speaking both of these operations were things that you had already
seen within logistic regression. We also use these two in logistic regression. We are just
going to formally combine these operations into a single thing called an artificial neuron.
So let us look at what an artificial neuron looks like in a neural network.
So suppose you have some inputs coming in from one end. We can call these, biologically
these are supposed to be equivalently of dendrites.
598
So you have
some 3 variables. In general some vector x vector which has these features, x1 , x2 , x 3. All
these three come in and as usual as we did with both linear regression as well as classification
❑
we simply take a linear combination. So we do ∑ w i x i and this component we call z.
i
Remember the x that comes in is a vector
and what comes out, out of this linear combination z is a scalar. Till this point
599
what we have here is essentially a linear combination.
Now this z goes into the next part which is the nonlinearity. And what makes neural networks
work really is this nonlinearity.
600
One simple nonlinear function which we have already seen, for example is the sigmoid
function. Graphically we denoted by the shape of the sigmoid curve.
1 1
So sigmoid of z would be σ ( z )= (−z ) , which is the same as (−Σ w x ) .
1+ e 1+e i i
601
So we call this nonlinearity in general g. This might be the sigmoid. It could be other
functions which we will see later, for example tanh. We also have another thing called the
rectified linear unit or ReLU.
So any of these outputs, any of these nonlinearities could be used.
Now finally after all this, what comes out is g ( z ) . This is denoted by a, also labeled the
activation.
602
If this is the only thing in your network then all you have is your prediction is simply the
activation of this neuron, which is g ( z ) .
So remember the two portions of an artificial neuron are linear combination and an additional
nonlinearity.
603
604
Feedforward Neural Network
In this video we look at a general version of the neural network called the feedforward
network. In the previous videos we had seen what an artificial neuron was.
A feedforward network or a simple neural network, the term that you would have heard most
commonly is basically a collection of neurons. Each of these units here is neuron.
Now each of these neurons or each of these layers
605
which are vertically concatenated have specific name. The very first layer is called the input
layer.
We had seen this during logistic regression even in linear regression. So you will have
multiple features. This is the input vector.
606
The intermediate layers here are called hidden layers. You could have multiple
hidden layers. If the number of hidden layers is greater than 1 then it is called the deep
network. Hence the name deep learning.
607
So deep network is simply a network with the number of hidden layers greater than 1.
The final layer where you actually get the output you are interested in is called
output layer. So you have our predictions. We have our predictions here, ^y 1 (remember hat is
used for predictions), uptil whatever is the number of
608
classes that we are predicting for.
Remember k in general need not be equal to n and in general each layer might have a
different size. So these are the elements. So each of these elements here is an artificial
neuron, Ok.
Technically speaking we can
even treat the input layers as if they were neurons, but generally it is only after the input layer
that we look at each of these neurons and call them an artificial neuron.
609
Remember that within each neuron we have 2 portions. We have
a linear combination and we have nonlinear activation function sitting there.
Now if I look at any neuron here, so for example this neuron it has inputs coming from all the
previous entities in the input layer, Ok. So for example, so this neuron here has n inputs plus,
even though not explicitly shown here, you will have a bias unit which will be coming in
here.
So for this neuron you have n weights from the input layer. So let us
610
take a general neuron or a general set of neurons in some hidden layer. So let us say this is
layer l, Ok for example
this one would be layer 1. This one would be layer 2, hidden layer 1, hidden layer 2.
Let us say we have layer l where all these neurons are there. And you have a prior layer. This
is layer l−1.
611
Let us say further that this layer had n neurons and this layer l has m neurons.
So this means the total number of weights required, assuming
612
every neuron is connected to every other neuron would be nm+bias.
Now how many bias units do we have in such a case, Ok?
Now if I consider this neuron for example, this takes input from all these n,
613
plus one bias. Now if I take this one, it also takes all these n, plus a different bias.
So each neuron gets a different bias. So in this case if we look at these m neurons, you have
nm which are normal weights, the number of bias weights will be equal to m, because each of
these neurons has a different bias.
So the total number of weights in such a case is nm+m.
614
In a feedforward network, all you need to do is you give all these xi , if you give x vector and
all the weights in every layer, we can find out ^y vector.
So this is called the feedforward process. Basically you are feeding the x ' s, you also give all
the w ' s of every single layer here and simply by taking a linear combination, nonlinearity;
linear combination, nonlinearity; linear combination, nonlinearity; you can predict all the y ' s
. This is called the feedforward process.
615
Such a network where all neurons are connected to every other neuron are called fully
connected networks.
In the general case you need not have all connections active.
In fact we will see later in convolutional neural networks that we have only some of these
weights which are non-zero and most of them are 0, which means each neuron is connected
only to a few other neurons in the previous layer.
So that would be the special case but in the most general case you can think of a fully
connected network, sometimes simply called FC network. The assumption behind the
616
feedforward process is you know all the weights. And you know all the input neurons.
Later on we will see when we come to back propagation how to actually determine these
weights.
617
Introduction to back prop
Welcome back. In the previous couple of videos you saw how a neural network does its
forward pass. Recall that when you have some neurons, let us draw a simple neural network
of that sort.
So let us say you have 3 features here, x 1, x 2, x 3
and some 4 hidden layers here, another 4 here. And let us say we have
618
3 as output layer, Ok.
So let us say we have a neural network which looks somewhat of this sort. Remember that
each node is connected to each subsequent nodes. Also notice that we typically do not show
the bias units, Ok.
619
The reason is very simple.
The reason we do not show a bias unit is there is nothing that goes from here to here. Because
this does not affect this unit which is always 1,
which is why we do not show it. So when you see a neural network diagram you typically see
only this portion, Ok.
620
Now notice that other than this, every node is connected to every other node. Ok so for
example this is connected to this, this is connected to this etc. I will not,
you know mess this up by drawing every single thing here.
So if I generally consider a node in one layer, let us call this a i, remember
621
a stands for activation. The activation comes after the two operations have happened, after the
summation and after the nonlinearity.
Ok so if I look at this a i which is sitting at
622
level l and a j which is sitting at
level l plus 1, the two are connected by a single line,
623
Ok. It is sort of like the Myelin sheath, anyway.
This we will denote by
w i j l. w i j by itself, that value is a scalar. It is a single value. It tells you that a j l gets some
contribution from a i l and the portion of that contribution is multiplied by w i j of l. Ok so
this is the weight at the lth layer connecting the a ith
624
neuron of level l with the jth neuron of level l plus 1.
So this notation is actually pretty straightforward. Remember this was the input layer,
these are the hidden layers
625
and this is the output layer.
Now depending on what notation scheme you choose, you can call this x 1, x 2, x 3. Or some
people and I will also do so; we choose to call it a vector at level 1,
626
Ok.
So level 1 simply is input vector.
Similarly this I could call a vector at level 4 which is the
627
output layer, y hat layer, Ok. So this then would be a vector at level 2. And this would be a
vector at level 3.
Remember a vector itself has 4 components, a 1, a 2, a 3, a 4 with the
628
superscript 2.
Similarly a 1, a 2, a 3, a 4 with the superscript 3. All this is so that I can kind of abstract this
out and show this as x vector which somehow gives me a vector at level 1
which somehow gives me a vector at level 2,
629
sorry, based on our notation I should call this a vector at level 2, a vector at level 3, gives me
a vector at level 4
which is the same at y hat.
630
Now at the end of this
y get your cost function J. Now remember in all our procedures what we will be doing is we
will be guessing for all of the weights. As you can see
631
even in this simple diagram there are a lot of weights, Ok.
So you have 3 neurons here, 4 here. You have 12 but actually more than 12 because you have
your bias unit also. But let us just talk about the weights other than the bias. So there are 12,
16, another 12 and then add the bias units also. You have that many unknown weights. So
you have to initialize all of them by simply guessing.
Once you guess you get a cost function J. Now ideally you would want that cost function to
be 0. You know that that is not going to be the case because your guess is typically not going
to be so good.
So this J is the function of y, the ground truth and your y hat.
632
And given that you are going to get the J, you have now to figure out which of these weights
was responsible for this higher J. What you want to do is essentially redistribute this J to all
these weights, Ok.
Remember
that some of the weights are just here, some of the weights go back here, some of the weights
go back here, Ok. So this procedure
633
of redistributing w using del J del w is of course called gradient descent but just calculating
del J and del w is called back prop or back propagation, more informal, more formally for
neural networks, Ok.
So whenever you hear
back prop, please remember this. All back prop is doing is calculating del J del w. So what is
the big deal in calculating del J del w? Why not do it in a simple way? So there is a simple
method. It is like this. You guess w, do a forward pass. What does the forward pass mean?
634
For a given x,
put those ws, calculate y hat. That is what the forward pass means. Remember this. That is
why we call it a feedforward neural network, Ok. So forward pass is simply going from x,
using some w
635
to y, y hat, Ok. Do a y hat, do a forward pass, get y hat. Calculate J of w.
Now make a slightly different guess. This is some perturbation. Again make a forward pass.
You will get some slightly different y. Calculate J of w plus delta w,
636
Ok.
Then del J del w is approximately equal to J of w plus delta w minus J of w divided by, even
though you cannot divide by
a vector, there are technical ways of doing it.
You do for, if you want del w, del J del w 1, you do delta 1,
637
so on and so forth, as we say in the partial derivatives example when we were doing
multivariable calculus. So this is called the finite difference method.
Now what is the problem here?
Why not use this always? The problem historically with neural networks was this is, even
though it is simple, simple in terms of coding, it is very simple to code. But it is very
expensive.
638
Why is this expensive? It is expensive because for each forward pass, for each gradient
descent pass that you have to do, you have to calculate multiple of these del j del ws, Ok. For
each parameter you will have to calculate del J del w and you could have millions of
parameters.
So these are millions and millions of calculations and for each calculation you will have to do
2 calculations, J of w and J of w plus delta w. This is very, very expensive. So this turns out
to be extremely expensive. Until the 60s, 70s also there was no easy way to do this and which
is why lot of people did not do large networks.
Till came the algorithm for back prop,
639
called the back propagation algorithm. So sort of the founding father of neural networks,
Hinton was one of the people who wrote a classic paper on back propagation. This is
application of neural networks to what is called automatic differentiation.
All these are fancy names. Basically
640
what we do is we use the Chain Rule. And it is
very similar to what I did in logistic regression, Ok. So we will be using the Chain Rule.
I am not going to do the full back propagation algorithm. This is not that kind of course.
Programming it is also very, very difficult. Kind of ironically, programming finite differences
is very easy but it is very expensive. Programming back propagation is very hard but it turns
out to be very cheap.
641
So we do all the calculations in terms of doing the theoretical as well as computational work
in order to do cheaper competitions. Tensorflow and every single package that you will find,
including MATLAB etc have back prop routines.
So what I am showing
in this video we will just, so that you can get a flavor of what is happening. So that you can
get some intuition of what is happening. And we will be using only this portion of this
intuition when we come to C N Ns or R N Ns to explain what problems we encounter while
training neural networks, Ok.
So please remember, this is just sort of, we will give you some of the final expressions. Of
course if you have complicated network architectures, these might or might not work.
But for simply, fully connected neural network of this sort, fully simply means that each
neuron is connected to every other neuron in the next layer called F C
642
network, fully connected network, in such cases typically the expressions that I will give later
on will be true.
More importantly my derivation you can treat as a toy derivation because I will do a very
specific, very, very simple case. This is going to be a very simplified derivation of the rule.
This is just to give you a flavor
of how back prop works. And I will come to some technical details towards the end of this
week.
643
So let us start with back prop and we will take the same case as before. I had x vector. We
can treat this the same as a 1. This gave me a 2. I had 2 hidden layers. There is a 3. And then
there is a 4. a 4 is the same as
y hat. So please remember this picture.
Now the change that I am going to make, so as to make our derivations simple is to treat all
these as scalars. I am also
ignoring the bias term,
644
Ok.
So in the general case remember these were vectors. The weights were matrices. All that
complication is being thrown away by me just in order to get you a picture.
Now surprisingly enough, even with that the expressions we get are very, very close to the
final complex expressions, Ok. So after this
we get J and we want to find out what is del J del w.
645
Now how many weights do we have here?
Notice from a 1 to a 2 you have one weight. Here you have another weight. And here you
have another weight,
Ok. So we have 3 sets of weights and this case just 3 weights because these are scalars.
So I am going to do the same thing as I did in logistic regression. I will just draw this figure
slightly differently just so that for ease of comprehension, Ok. So first we have the linear
operator which gives us z, let us call it z 2. Go back here and we get a 2, Ok. So we have a
nonlinearity g.
646
Similarly you take a weight w 2, get z 3, go back here, you get a 3.
Similarly here, so let us put a g here also. I will change colors. You have the weight w 3, you
get z 4. Go back here, get a 4. a 4 is the same as y hat.
647
And I get here the J here.
This is the nonlinearity g, Ok.
So let us write some expressions down just so that we can use them for clarity. a 1 is the same
as x. Then
648
if you notice here z 2 is equal to w 1 a 1.
z 3 is equal to w 2 a 2.
649
z 4 is equal to w 3 times a 3. This is my simplification of the
linear summation process that we have.
If we were dealing with the full vector case, all that would change here is this would become
w transpose times a vector, w 2 transpose times a 2 vector etc. Now apart from this, we have
the nonlinearities. We have a 2 is the nonlinearity applied on z 2.
Similarly
650
a 3 is g of z 3,
a 4 is g of z 4.
651
Finally y hat is simply a 4, Ok.
So these are our relationships. Finally what we want to find out is J. How much does J change
due to a change in w 1, Ok? You will notice, what will happen? The moment w 1 is changed
z 2 is changed, a 2 is changed, z 3 is changed, a 3 is changed, z 4 is changed, a 4 is, so it is a
cascading set of problems.
So if you have del J del w 1, then what is this? If I have del J del w 2 what is this? Similarly
del J del w 3 what is this?
652
So these are the questions that we need to answer. Now notice it is actually easiest to find out
this term. Why is that? Because this is closest for being responsible for J.
So let us find this term first, del J del w 3. If you had been very careful, you will actually
notice that this is very similar, in fact practically identical to what we had
in logistic regression. You had a input, summation, nonlinearity you immediately got the
output and that is what we got in logistic regression.
So for now, for the sake of this example I assume that J is the binary entropy cross function, y
l n y hat plus, because we had already done some calculations for this.
653
So let us assume that this is the binary entropy cost function. And that g is the sigmoid.
We will assume this but you can do this process for any g and any J, Ok as you will see, you
will see shortly. So let us say I want del J del w 3. What is it equal to? del J del a 4, this step
times del a 4 del z 4, that is this step multiplied by del z 4 del w 3, Ok.
654
Now if this is the binary entropy cross function we had actually done this calculation. This is
the same as del J del z 4
and we have done this in the previous video. This is already equal to minus y minus y hat.
655
So we have already calculated this before. Please look up that video to convince yourself that
this is exactly the same.
What is del z 4 del w 3? You can automatically see this. This is a 3. So let me write this
down.
del J del w 3 is equal to minus y, minus instead of y hat, I will write it as a 4 times a 3. Now
this term as we have seen before, is the error in output activation.
656
Now we use a particular notation, Ok, we use the notation that del J del z 4 is defined as a
quantity called delta 4.
Notice this 4 and that 4 are not the same. Similarly we will say del J del w or del z l is defined
as delta l. What does it denote?
657
It kind of denotes, Ok this is not exact but it denotes this term, error in activation.
So I want to warn you before several questions
come that this is simply heuristic or just for, in order to you, in order for you to build an
intuition about this thing. So what does this mean?
As you
658
perturb this, instead of what is supposed to be the actual a, you know in the final case when it
is very well fit, you are going to have something slightly different, Ok, just like this term, Ok
instead of a, you would have a plus something, Ok or a minus something.
And that is what this term delta 4 denotes. So we will keep that and we will write, but I have
not made any approximation here. All I have said is del J del w 3 is equal to delta 4 times a 3,
Ok.
So please notice this, Ok
Now suppose I am going to do del J del w 2.
659
Let us go back to the figure del J del w 2.
What will it be? You will do this, this, this, this, and finally this. So you are going to have 5
terms. So please track it with me. This is going to be del J del a 4 del z 4, del a 4, del z 4 del a
3. Please notice this, del z 4 del a 3.
Then del a 3, let us go back to the figure, del a 3 del z 3. And finally del z 3 del w 2 as you
can see in the figure here, del z 3 del w 2.
So this term, this term, this term, this term and finally this term, Ok. So this looks very
tedious until we notice a certain pattern.
660
You will notice that all this chain is simply del g del J del z 3.
This term is simple because z 3 is equal to, let us go back to our relationships, z 3 is equal to
w 2 a 2. Therefore del z 3 del w 2 is simply a 2, Ok
661
Therefore del z 3 del w 2 is simply a 2, Ok
So if you notice this, del J del z 3 by our notation we had called this delta 3.
662
So you will get del J del w 2 equal to delta 3 a 2.
Notice these two formulae and you will automatically see a pattern.
663
You notice that del J del w l is equal to delta l plus 1 a n.
This is our
first relationship, that del J del w l is delta l plus 1 a l. Ok so in this term this is what we want
to find out finally. Do we know everything? We know this. How do we know a l? Simply
from the forward pass. So if I made a guess for w, I would already have a l before finding out
y hat, Ok.
So a l is known from the forward pass. However delta l plus 1
664
is not known. It is known only in one specific case. Which case? This one. Because it was the
output error, Ok. So we know delta 4. But suppose if I asked you
what is delta 3, we do not know, Ok.
665
Similarly if I wrote del J del w 1 this would be delta 2 times a 1. a 1 is known
but delta 2 is not known.
666
So can we find out these two terms? We know the final one and you will always know the
final error. So I will call it L where L is the number of levels.
So you will always know the delta at the final layer one way or the other, you can always find
out this, as a combination of analytical and computational procedures using the same idea
that we did here.
You just differentiated, use whatever nonlinearity you have there. So that can be found out.
So we need to find other deltas,
667
Ok. So let us now write the relationships for these two. So notice this. Remember delta 4 was
defined as del J, let us go back up here, so del J, del J del z 4.
And delta 3 is del J del z 3, Ok.
668
So let us write the expressions for this. del J del z 4 was del J del a 4 multiplied by del a 4 del
z 4.
What about delta 3? This was del J del a 4 del a 4 del z 4. Just to refresh your memory let us
go back to the figure. del J del a 4, del a 4 del z 4, now we want till z 3. So del z 4 del a 3, del
a 3 del z 3, del a 3 del z 3, Ok.
669
If you compare these two expressions, this and this, you will notice that this portion is already
delta 4, Ok. So that portion is straightforward.
What about this? Let us look
at this term, del z 4 del a 3. Come back to the figure, del z 4 del a 3 is simply this term w 3.
You can
670
see that. del z 4 del a 3 is w 3. So let us write that here. This term is w 3.
What about del a 3 del z 3?
del a 3 del z 3 is related by g, so notice this del a 3 del z 3 is simply g prime of z l.
671
So let us write this, delta 3 is equal to delta 4 multiplied by w 3 multiplied by g prime z 3.
Turns out that similarly delta l is equal to delta l plus 1 w l...
672
So if you combine this with the other expression we have, it was del J del w l, let us go back
to the expression here, is equal to delta l plus 1 a l.
These two combined give us the back propagation algorithm.
673
How is that? It is very simple.
In this network you start with the last layer, Ok which was the a 4 or the y hat layer. Calculate
delta 4 there.
In the example that we took, delta 4 was simply minus y minus y hat,
674
Ok with the particular nonlinearity and the cost function that we took, Ok.
Once you have delta 4, using this expression you have delta 3, you have delta 2. And then
using this expression you simply have del J del w 1, del J del w 2 and del J del w 3. So every
single thing, so all gradients can be computed.
So notice that for one particular gradient computation, unlike finite difference, you do not
have to go to each J or each w and calculate a simple, a different perturbation and calculate a
different weight. That is much too expensive. In fact we do it only in order to cross check
whether we have written the gradient, whether we have written the back propagation routine
correctly or not.
675
Other than that, it is actually possible to do one single pass, calculate all the as and then
simply do one, one single back pass, Ok or back prop step and calculate all the gradients.
In fact these gradients are exact to machine precision. Why? Because we have not used any
approximation anywhere.
We have only used a simple Chain Rule, Ok so this is simply the algorithmization of Chain
Rule.
We have not used any approximation here. The only approximation which will come is due to
machine round off errors which we had seen earlier. This is the back prop algorithm in case
of a scalar expression. For the general vectorized expressions,
676
it turns out that the expressions are remarkably similar.
So if I take this case. You have the weight w i j l connecting a i l to a j l plus 1.
Then if you want del J del w i j l, this is equal to delta J l plus 1 multiplied by a i l. It is
actually remarkably similar to the previous formula that we have. It is very simple.
677
Error in activation of the next layer multiplied by the activation of the previous layer. This is
all there is.
Please compare it to our other expression which was del J del w l is equal to delta l plus 1
multiplied by a l. This is very, very similar. Ok this is the full scale expression
in case of a fully connected layer.
What about the other activation formula? Now you will have to deal with whole vectors. delta
l plus 1 turns out, is equal to weight at l multiplied by W l...
678
Notice that this is a matrix, this is a vector. When you multiply it the two, you will get a
vector. This is also a vector. The sizes work out due to the weight matrix. This is another
vector. z is also a vector now. And this, remember is our Hadamard product which we had
seen in week 1, what is called element wise multiplication.
So just as a summary of the video what we saw is, if you treat a neural network more or less
just like we have been doing either logistic regression or multi-longitudinal logistic
regression or in fact even linear equation you had some x, somehow or the other using some
guess weights w, you are getting y hat and you improve your w using del J del w.
679
The main computation in neural networks is calculating this del J del w for a given
guess w. That we do using back propagation. The idea is already shown here. The reason it is
called back propagation should be obvious. That is because we first calculate the error at the
last layer and then start propagating.
Ok, if this was my total error how much was my each node responsible for this total error? So
you take delta at the last layer, find out delta at the previous layer, previous layer, previous
layer, previous layer so forth and then simply del J del w is delta at the next layer multiplied
by activation of the previous layer.
680
So this, in this video I just showed you a quick scalar derivation of back propagation. In
general for complicated networks, you know, you could have networks with all sorts of skip
connections which instead of going from here to here would do this.
So what Tensorflow and other software like that do is to create what is known as a graph, that
is
they find out how is each node connected to every other node, this is how you would
represent, in fact Tensorflow network diagrams and based on automatic differentiation it does
back prop for you. You do not have to write it.
681
Currently nobody really has to write back propagation routine. It is actually already available
in every single piece of software; this is just for you to build your intuition, on what tends to
happen. As we will see in next weeks, it is sometime troublesome when you have very large
networks because what is called gradient does not flow back, Ok.
So if you have a very, very long sort of back propagation algorithm. Errors multiply and due
to machine epsilon problems, you tend to not have proper changes in gradient later on. So this
explanation was just to give you the intuition about what could possibly happen. Thank you.
682
Summary of Week 5
In this video we will just summarize what all we have seen in week 5 and also what we did
not see in week 5. And just to give you a preview of next week what we will be doing.
So what we saw was when you have a linearly separable classification case
that is if you have data points which can simply be separated by a line such as this data set.
683
In such a case you could use logistic regression.
Logistic regression or binary logistic regression can be used when there are just 2 classes.
And the same idea we saw could be extended to k classes
684
using multinomial logistic equation.
In both these cases the major differences were simply in the forward model. The forward
model for logistic regression was sigmoid of w dot x. And
for multinomial logistic regression was Softmax of w dot x.
685
Now apart from this we also had our loss function which was the binary class entropy loss
function for logistic regression. And in the case of multinomial we saw that it was a simple
extension. It was a general cross entropy loss function there.
In both these cases it was fairly straight forward in calculating del J del w. It turned out to
give us the same expression as before which was y minus y hat times x summation from i is
equal to 1 to m.
Now this followed our general machine learning paradigm which is, you take x, guess a w,
get a y hat back propagate. This is what we did in logistic as well as multinomial logistic
686
regression cases. Therefore these, when we tried this for XOR we saw that it needed an extra
layer in the middle.
It is not possible to simply take an input and map it directly to an output
without any hidden layer.
However with the extra layer it is possible, it can be proved that you have universal
approximation here which says
687
that any function can actually be approximated to an arbitrary degree of accuracy provided
you are willing to increase your number of neurons. It is possible to approximate any function
to an arbitrary degree of accuracy using one single hidden layer.
Neural networks however use more than one hidden layer and there is some disagreement on,
in the literature on this.
More than one hidden layer and this is what is called
688
deep learning. Deep learning simply means greater than one hidden layer.
That is typically what is called deep learning. There is some disagreement in the literature on
this, on how many layers should you take, or should you even just make do with one hidden
layer.
Some people are of the opinion that with certain tricks you can get by, but generally the
observation is you get fewer neurons and fewer weights the deeper that you go, Ok.
Now in order to train deep neural network you need however back propagation algorithm of
which
689
we saw the rudiments in the previous video. Now one thing that tends to happen is if you
recall our expression was delta l plus 1 was delta l times g prime z.
Now notice this term g prime z. When you have a sigmoid this g prime or the slope of the
sigmoid can actually get small,
further and further away from this central portion which has
690
high slope. Further and further away you get, this can get very small. And it can keep on
multiplying.
So you have delta 3 is some small number, let us say point 1 multiplying delta 4.
delta 2 will be that small number multiplying delta 3, so on and so forth.
691
So if the small numbers keep on multiplying, it can actually get very, very small and it can go
below the machine epsilon and then network will, what is called, it will not train.
Similarly here too it will stop training. This is called saturation, that is your value is so close
that your slopes are very, very low. And this is the problem, fundamental problem in training
deep networks.
You tend to get one of two
692
problems which you will also see in the next few weeks which is either of exploding
gradients or of vanishing gradients. That is, w actually completely blows up of which we saw
few examples even during linear regression. That was due to improper gradient descent.
Or you could have something which you think should train but it does not train. And this is
where a lot of neural research stagnated.
So there are tricks in order to do this and you will see Doctor Ganapathy will discuss several
tricks for this in the context of convolutional neural networks next week. What is it that we
did not cover?
So one was this. The other things that we did not cover and which we will be looking at next
weeks is how do we initialize w? As we saw
693
even for logistic regression or neural networks the minimum is not unique. Since it is not
unique how you initialize actually has the effect on how your neuron network trains.
Second thing is how do we determine the number of layers, number of neurons per layer. I
just showed something arbitrary here. Both these are also hyper parameters. Remember
in addition to alpha which is your learning rate, and lambda which is your regularization
parameter, number of neurons, number of layers per neurons all these are also treated as
hyper parameters.
694
And hyper parameter optimization is a big problem. It is an open problem in neural networks.
Doctor Ganapathy will be discussing a few details about this later.
Finally what nonlinearity do you use? I showed just one. I showed sigmoid.
But there are other possible nonlinearities that people use. One is tan h which is very
similar to the sigmoid and instead of going from 0 to 1; it goes from minus 1 to 1.
Another possibility is something called rectified linear unit. In short
695
it is called Re L U. It is completely flat at one end and then it is simply linear.
Now different choices can be made for different problems. As a very, very simple rule of
thumb, for problems with numbers we tend to use artificial neural networks and we tend to
use tan h instead of sigmoid.
For convolutional neural networks we tend to use R L U which you will see in the next week.
So these and other issues we will be seeing in the following week. And the final heads up for
next week we will be moving to what is called convolutional neural networks also called C N
Ns.
They are a special case of A N Ns, or artificial neural networks or deep neural networks that
we just saw for vision problems.
696
Is there any problem with A N Ns that we cannot use it for vision problems? No, not really.
The only issue is that, let us take my favorite example, that of a 60 cross 60 image. Let us say
you have 3600 features, this is just linear features and suppose you have 3600 in the next
layer also, so you can see that this is 3600 square weights already which is a huge number of
weights.
697
And vision problems deal with large images so you will have very large features which
means you have to deal with a huge number of weight. So instead of doing that the trick is to
use what is known as communication neural networks. We will start seeing that from next
week. Thank you.
698
Machine Learning for Engineering and Science Application
Professor Ganapathy Krishnamurthi
Introduction to Convolution Neural Networks (CNN)
Welcome to the series of lectures on convolution neural networks. CNNs are basically a
special class of artificial neural network that you see in regular neural network which expect
images as input. They are designed to work on images mostly to handle computer vision
problems. Like artificial neural networks that you have seen before the regular artificial
neural network, these networks also have weights, neurons and bias units and the weights in
these CNNs are also estimated by optimising an appropriate objective function because input
to these networks or images it allows for 2 things one is sparse connections we will see what
those are as we progress as well as parameter sharing and because the images are used as
input these 2 concepts are possible basically it is possible they have sparse connection as well
as sharing of weights between the output neurons in a layer.
699
CNNs in the recent past and of course especially since deep learning has taken off has found
used applications basically in image recognition, object detection and localization, semantic
segmentation and medical image analysis these are some of the areas where CNN have
shown extremely good performance on many benchmark problems and are now being tried
out for many commercial applications.
So for instance, let us look at this particular application, so this is an image from the wild its
images showing rural scene somewhere and this image is given as input to the Google cloud
vision program it is really available and you can try it out then it automatically gives you
700
describe the image as saying it is a herd there is a 91 percent probability that it has goads
them and there is actually a herder in there. It also identifies grass, it also says livestock in
this case and that this guy is the man in the picture is actually herding, so this level of detail is
possible this performance is possible with current CNNs.
Here is an output from another CNN from Silver Pond it is an object detector, so it is able to
identify the man in the picture as well as the goat. However it does label it wrongly as I think
a horse okay, so it is able to localise as well as identify the objects okay it is a typical
application that you would do with… in a computer vision and this not too hard to think of
some very… What everybody now talking about is self-driving cars you can see that if you
have vision system in our self-driving car and you can and it performs well that you can
identify obstacles or you can identify a pedestrian or signals or lights, zebra crossing et cetera
and act accordingly. So this is a typical application in computer vision.
CNNs also finds applications in image analysis and in this case we are looking specifically at
medical image analysis, so if you look at this image here on the left it is an image of a brain
MR image of the brain it is been pre-processed to some extent, so you can see some
abnormalities here and there okay and what you see on the right here are basically the pixel
labelling task done by a CNN where it correctly identifies or fairly correctly identifies regions
that appear abnormal.
701
The various color schemes here corresponds to different types of classes within the
abnormality itself. There is a huge advantage in terms of at least in medical image analysis
wherein you know this is just one slice in a brain, so a typical medical image where there are
hundreds of such slices going through a particular anatomy and going through them manually
and labelling each of these boxes by hand is pretty much very tedious and very error prone
task , so this can actually serve us huge support for a radiologist who can who look at these
kind of images every day for interpreting them and diagnosing patients.
So what we have here representative images from the image net database, now this image net
challenge, visual recognition challenge has been going on for quite a few years now basically
the challenge organisers make available to you millions of images drawn from the wild from
internet and labelled by expert as belonging to one of thousand categories and the challenges
to create a visual machine learning on AI system that when given an input image from a test
set, you again containing several hundreds of thousands or millions of images. A test image is
able to correctly classify it okay.
So shown here are several images, so this is actually image of sidewinder I have already
marked it with red and this is actually marked as a hatchet that is the correct level the blue
shows the correct label and this is again schipperke, I actually do not know what that is but
anyway, so these are some of the prediction made by typical network which are staying on
the database. The challenge is that the correct class should be among the top 5 prediction of
702
your system okay. So over the last few years these CNNs have proven to have outperformed
many other systems, AI systems trained for this task. Human error rate itself is around 5 to 6
percent and there are now large CNNs deep CNNs which out-performed this.
703
So for instance if you look at the accuracy of some of these are the names given to a different
Convolution Neural Networks by the authors who build them or the people who worked on
them. Now the top 5 accuracy if you is pretty impressive it is almost about 95 to 96 percent
which is approaching human top 5 accuracy okay. We will see what this means the parameter
of the network, the depth we will see later what it means but parameter of the network are
basically the number of weights in the network. They range from million to 140 million okay
so overtime people have started with started off with a very large number of weights and over
time the network have trim down they have gone deeper but they have managed to reduce the
number of parameters in the network and also improve the top 5 accuracy.
704
Okay so before we move onto what CNNs do and how to build your own CNN, let us look at
how images are parameterised okay. So we saw earlier that CNNs take images as input, so
what does it mean, so you know that for an artificial neural network the input is usually a
vector, vector of values or labels or some categorical label if you want but as far as CNN is
concerned the Convolution Neural Networks is concerned the image is an input, so there are
different types of images grey scale as well as RGB, so what do you mean by saying images
as input.
So if you take grayscale image here is a digit, image of a digit 8, so you can think of the
image as being made up of… so this is actually a matrix so this is actually a matrix it is a 2D
matrix and image is made up of pixels with each pixel having a particular numerical value, so
this whole image is actually a 2-D matrix and if you can think of it as like a grid I can draw
some crude grid like structures here and it is very coarse view of the image, so at each grid
point there is a numerical value associated with it.
So that is the pixel value so in a typical image for instance the image is that you take with
your camera the values of the pixel range from 0 to 255 they refer to as 8 bit images and the
dimensionality of the input is basically the size of your 2-D matrix, so you might have nx
pixels on the x-axis and ny pixels on the y-axis, so your image input is of size nx × ny , this
is your input. If you want to think of it in terms of artificial neural network your regular
artificial neural network you have an input vector of size nx × ny .
705
The given example MNIST database has images of digits which are of size 28 × 28 pixels so
that means it is a vector of size 784, so as far as grayscale images is concerned you can think
of them as a 2-D matrix dimensions nx × ny depending upon how many pixels are there
along the x or y axis and if you think in terms of regular ANN the size of the input is
basically the total number of pixels in the image.
706
Now by images we also mean RGB images in that case let us take a color RGB mean
basically typically color images if you take an RGB image we can extract the individual
channels in an RGB image. So an RGB image consist of 3 channel R, G and B. Each channel
itself is… each channel is an image and the pixel values in each of the channels it is range
from 0 to 255, so what we get as an RGB see as visualize as an RGB image is basically the
combination of these RGB pixel values.
So if you want to give a CNN and RGB images as input which means that so you have for
every image nx × ny , nx × ny pixels as the input, however there are also 3 channels as they
are called. This is the terminology typically used in CNNs. So your input is basically
nx × ny × 3 , the size of your input. Once again as before if you want to think in terms of
regular ANNs you have to rasterize the image and making into vector of size n nx × ny × 3 .
So the CNN basically takes if you can generalise CNN basically takes a volume as input by
volume I mean that you have a pixel array of given size nx × ny however there can be
multiple such array which gives rise to volume. CNN takes as volume as input and assigns
that volume to a particular class label based on your objective function.
707
So since we can rasterize these images so we can read out these pixels values one at a time
and form a vector and then why not we just go ahead and use a regular artificial neural
network. Now if these images are small let us say if we have a 32 × 32 image or 30 × 30
images for the sake of calculation so you have 30 × 30 image. There are 3 channels okay so
basically 2700 input neurons for a regular ANN okay so you will have over 2700 neurons,
however most regular sized images of the order of 220 or 256 × 256 and if you have 3
channels, so this is already of the order of 104 and 105 neurons okay.
So which means that if you are taking we want to get a hidden layer of 1000 neurons which
will give rise to 108 weights this is a very conservative estimate because sometimes the
images are as large as 1000 × 1000 or 512 × 512 especially medical images are quite large,
so as the size of your input image is increase ANNs do not scale very well so they are unable
to handle such…the number of weights or there is an explosion of number of weights that has
to be estimated using ANN that means that proportionally a large number of data points are
required.
Another aspect of why you should not use ANN this because ANNs once because if ANNs
takes as input vector so which means that even if you are given an image we have to victories
it by rasterizing it and in that process we will lose the spatial structure of the data okay, so
images have spatial structure which is what we want to exploit by using a CNN and in the
process we also as we saw earlier exploit 2 more things we get as sparsely connected network
708
which means that there will not be as many weights as artificial neural networks. In addition
there is also a parameter sharing which again reduces the number of weights and the
parameter sharing which in turn enables us to exploit the local connectivity of neural
network.
The parameter sharing enables us to exploit the local connectivity in an image, so what does a
CNN consist of okay, CNNs like ANNs consists of sequence of hidden layers but these
hidden layers are basically convolutions or pooling, so we will see what these are in later
slides but it is an alternation of convolution and pooling layers followed by a series of fully
connected layers just like in artificial neural networks leading to a classification layer. This is
a typical structure of a convolution neural network.
Now why do we have this kind of structure, what is this convolution doing here? What does it
mean? What does convolution accomplish? We will examine that okay before we go there
will also look at…just to summarise CNNs take as input must as we saw earlier images but
these images can have multi channels, so simple example being a RGB image which has 3
channels, so they take as input a volume and in each layer in a CNN outputs volume
irrespective of whether it is a convolution or a pooling layer okay this refrain from contrast to
ANNs where the output is based on every layer is basically another vector of neurons.
709
Just to summarise again in visual form of convolution layer takes us so in this case the input
is an RGB image and we will not know we will not look at what the operations are right now
but the output from a layers is basically a volume here where if you can slice the volume this
way okay then have multiple outputs or multiple 2-D outputs, each of these are 2-D map.
Each of these 2-D outputs is often referred to as a feature map or an activation map so the
number of output features map or the activation maps is entirely within our control we will
see how that can be defined okay and the size of the 2-D map the nx, ny coordinate of 2-D
map again is determined by the operations we perform.
Similarly, even for a single input channels, so this is for 3 input channels so irrespective of
the size number of channels in your input you can have multiple channels in your output this
is true of every layer, so for instance if you take this layer this can again undergo another
convolution leading to even more higher number of activation maps being output we will see
that in some popular CNN architectures later on.
710
So why convolutions or what do these convolutions accomplish, what are they inspired by?
We will look at what exactly convolution do but before that…so in 1960s Hubel and his
colleagues did a series of experiments measuring the activations or signals from neurons in
the primary visual cortex of cats okay. We will not go into exactly how they did the
experiments but these are biologist they knew what they were doing.
They eventually won a Nobel Prize for this work, they found out that in the primary visual
cortex, the primary visual cortex in turn we will see later will take its input from the retina.
The retina is where what we see is projected so our eyes, the lens of our eye projects
whatever we see on to the retina, so the primary visual cortex and the signal from the retina
goes to the primary visual cortex. So the primary visual cortex they found out have 2 types of
cells, neuron cells, a simple cells.
Simple cells of course they get their signals from the rods and cones in the retina and the
simple cells respond to adjust, so they have very good response to adjust of different
orientations and it was a linear response okay and then there are other types of cells called
complex cells which seems to take input from the simple cells, so a linear combination of
input from the simple cell and had a nonlinear response. One aspect of it was that it was
insensitive to translations, so you can move an edge across the eye or have a project an edge
on the retina of the eye and move it across the retina but the output from a complex cell
711
would be the same that is what it means. So they concluded that the visual cortex have these
2 types of cells and this behaviour they characterised okay.
So this was the inspiration for…so if you want to look at it, so this let us say is your eyeball
and this is your retina right here. The simple cells here take as input signals from a particular
region this is the region, small region here for this cell, so this is called the receptive field of
that particular cell or group of cells, receptive field okay. So what it does is it does a linear
combination of the signals from the receptive field and provides an output okay.
Similarly there are a bunch of these simple cells each has its own receptive field in the retina
of the eye and they in turn provide an output and this complex cell takes a weighted
combination of these inputs and provides an output and it is nonlinear output okay from the
cells. So this is the inspiration behind CNNs, so they try to mimic this vision this is of course
the general wisdom that goes around but that is not a very good understanding… There is a
lot of progress in this field that how vision system works but this is a very simplified like I
said cartoon version in my case that is what I have done here cartoon version of how the
vision system works.
712
Another way of looking at it is why do you use convolution is that if you go back to signal
processing or conventional image processing techniques it is well-known that if you have an
image is we can define filters or what you call kernels okay, you can have filter or filter
kernels as they are called ill which if you operate on an image should be able to extract
different features from it, so in this case I have shown a very simple kernel say 1 row to
column is 1 cross 2 filter kernel, so all you have to do is superimposed on the image, so this is
1 and minus 1 multiply with the underlying pixel values and add them.
You can see that by applying this filter kernel and of course translate and do the same thing
everywhere okay. So if you do that all across the image then you will get the edge map. Now
it is possible to construct by hand different filters like these which will highlight edges at
different orientations. So there are multitudes of filters so for instance there are Sobel filters
or prewitt filters and so on which has been already which means they are… It is well-known
in I mean in traditional image processing literature that these filters can use to highlight edges
in an image which is very similar to what we saw with how the simple cells work is basically
remove this kernel so the receptive field for this kernel is about 1 cross 2 okay.
So we can make 3 × 3 Sobel filter so the receptive field for that filter is a 3 × 3 region in
the image let us say if you define a 3 × 3 filter and this will be receptive field of the filter,
so the idea is if we define enough filters then we will get a variety of edge map this is just
one… we can call this the first convolution layer and then we do more combination of those
713
edge maps to get may be higher order description of the image, so what these filter kernels
exploits is that the edges in an image are similar everywhere, so for instance if you have a
vertical edge somewhere in an image, in one region of the image let us say there is an edge in
this case there is an edge here.
I have a similar edge here too the orientations are different, right so I can have an edge here
which is very similar to the edge as I just pointed out this edge here and this edge here are the
same very similar. So if I define a filter that picks out an edge at that angle I will say that is a
45 degree edge than I can use it all over the picture to highlight that particular edge okay, so
this is what is parameter sharing is all about in the sense that for every… for identifying a
particular feature I do not need to define a new filter for every region in the image. It is the
same filter that I can apply for wherever there is an edge at this particular angle that filter will
pick it okay. This is another way of looking at convolution neural network okay.
So what does convolution accomplish? So like we saw the convolution neural networks same
structure as artificial neural networks there is an input layer forward by sequence of hidden
layer and then there is output. Now the output of any hidden layer and in general you can
think of input as also as a layer in the network, so the output of every layer, the neuron in the
output of every layer is connected to a small neighbourhood in the input that is what the
convolution kernel accomplishes, the filter we saw on the last slide that is what it
714
accomplishes and the connection is through a weight matrix which we call the filter or a
kernel okay.
For every convolution layer we can just define multiple filter kernels and the way it works is
that we move the filter kernel around the image at every region and every position that we
move it around, we multiply with the underlying pixel values and add them up so it is a sum
of products, so which gives rise to a corresponding output, so since we can define multiple
filters in every layer we can stack the output of each of the filters by obtained by applying
each of the filters in the input and giving rise to another volume of hidden neurons.
So let us just look at how a typical convolution works, so let us just look at how a typical
convolution works, so what we see on the left her is your toy image can call it, it is a 5 × 5
image and on the right is your 3 × 3 convolution kernel okay, so how does one actually
perform the convolution that is what we are going to see.
So it is very simple, all you have to do is to superimpose the convolution kernel starting at
some point at the top left part of the image you can start from anywhere typically top left
corner of the image multiply so it will just be
[1 × 1 + 0 × 1 + 2 × 0 + 0 × 2 + 2 × 1 + 1 × 1 + 7 × 1 + 0 × 0 + 1 × 2 = 13] so on and
so forth and you multiply and you add so it is a sum of products, sum of the corresponding
elements which means that the corresponding elements in the image with the filter weights.
715
So that give rise to one element in your output feature, so the next step is to slide the kernel to
the right or by 1 pixel and perform a similar operation, we can keep doing that because once
we hit the edge of the image, so now if we go any further then the filter will not fit
completely in the image, so we will stop there and then we move 1 pixel down and continue
to do so and at every point we placed the kernel, we multiply with the underlying pixel values
add and then obtain the corresponding output there.
So as we move through we see that at every position we perform the same operation and once
again if you come here and if you go down any further if you move the… shift the filter down
any further then it will not fit inside the image, so we stopped right there. So in general if you
have an image or input of size nx × ny and your filter kernel is of size f x × f y . Your output
size will be ( nx − f x + 1) and ( ny − f y + 1) okay so this is the basic output. So you see that
as we do the convolution the output size keeps decreasing and there is a way to stop that, we
will see how that is done in a more systematic way but this is typically what happens when
you do convolutions.
So at every convolution layer you will define a multitude of these filters, so when I say define
you really do not know so because these are the weights right these are similar these are the
equivalence of the weights in your artificial neural network, so that is what we typically
estimate in artificial neural network, the weights of the network by optimising an objective
function. In this case we will determine the members of the filter kernel again by optimising a
suitable objective function.
So at every layer given an input we can define many such filters kernels so we will define K
filter, kernel K can be any…very large number so for there are instances there are networks
which define 512 such filters in every layer okay and so the output would be K feature maps
of each of this particular size, so typically in every layer each of the filter maps are of the
same size even though there are exceptions which we will again look at later and later videos,
so typically you also… For instance you would… 1 layer would contain you would define
3 × 3 filters about 256 of them. All filters would be of size 3 × 3 , so all the feature maps
would be of the output feature maps would be of the same size but they will be about K of
them.
716
Now in this another operation which we define there was pooling, so following convolution
or a series of convolution there is also pooling. What does this pooling accomplish? One of
the major advantages it supposedly gives is the translational invariance, so it is very easy to
visualise. If you have an object in your picture and you are trying to localise it let us say you
are using the neural network, now if keep subsampling the picture let us say you subsample
the picture from 256 × 256 → 32 × 32 ? Almost a factor of 8 reduction factor 8 reduction
okay and almost or exactly a factor of 8.
So let us say if the object moves around inside this image, inside the 256 × 256 image. If it
moves less than 8 or 16 pixels you hardly see a motion in the 32 × 32 image, so basically
what this pooling performs is a subsampling operation. It reduces the size of your feature
maps and as you build more and more layers it comes to a point where in very large motion
of…so the network kind of becomes invariant to very large motion of the object you are
trying to detect in your main image. Typically average pooling and Max pooling are
commonly used.
717
We will look at Max pooling average pooling should be a value obvious, so let us take this
feature map of size 4 × 4 , this is of size 4 × 4 . A typical Max pooling operation would be
to... It is again you can think of it as using a kernel, so you define a 2 × 2 kernel here okay
and what max pooling does is to look at the maximum value inside this 2 × 2 space, right so
here it is 6 and the way Max pooling is done unlike we saw that for convolution you just slide
the filter kernel by one, here you slide the…you do not have overlap.
So you would side so that there is no overlap you can also skip we will see that is stride and
there is no overlap and in this window 8 is the maximum. Similarly here is 3 in this window 3
and in this window 4 is the maximum, so if you do Max pooling of size 2 × 2 , right with a
stride in this case, the stride is how many pixels do you move before you do another Max
pooling operation, in this case you move 2 pixels so the stride of 2 then you will half the size
of the feature map.
So this is the function of the Max pooling so if you want to be very systematic about it you
would try to retain the size of your feature map when doing convolution we will see how we
can do that and we will try to…any subsampling will be done during the pooling operation.
For average pooling instead of the Max value you will take the mean of the values inside this
2 × 2 neighbourhood.
718
So we saw that… we mentioned earlier that the neural network, the volume metric
convolution or volume convolutions are done okay, so what do you mean by that? So because
whatever you have seen so far we are usually defining the filter kernels to be of a 2 × 2 …as a
2 dimensional matrix right, so for instance we have a 3 × 3 filter kernel right? So f x × f y so
what do you mean by volume matrix convolution? How does it work? So let us take this
particular example, so input let us just look at the input layer it generalises to all layers, so
input layers is an RGB image we have 3 channels, so RGB so input is 3 channels, so basically
our input size is nx times ny times 3 okay.
This 3 is each of them the R, G and B channel and each of them is an image, now when you
perform a convolution, so let us say we define a 3 × 3 convolution and it need not be a 3 × 3
can even do a 5 × 5 convolution just to prevent any…if you are uncomfortable with 3 × 3
you can use 5 × 5 kernel it is easy to draw that, so this is your filter let us say you are doing a
3×3 convolution on your input image which has 3 channels. Now even though we say
3 × 3 the filter itself would be defined as a 3 × 3 × 3 , so there will be a filter which operates
on the blue, green and red.
Each of them would be operated on by a 3 × 3 concurrently okay, so a neuron in one of the

output feature maps is the sum of all the outputs right, so we saw earlier so when we go back
to these slides but in video we see that we saw earlier that we superimpose the filter kernel on
a region at a location in the image multiply with the underlying pixel value to get one output
719
this is on a 2-D feature map, 2-D input and 1 filter kernel. Now we have in this case 3
channels, so the filter kernel itself is 3 × 3 × 3 . If we had 5 channels as input then the filter
kernel would be 3 × 3 × 5 with 45 here it is 27 values here it will be 45 values that we
are…plus a bias units if you want to include the bias okay.
This is how volume metric convolution are done, the filter acts across the channels or across
the volume okay, so in this case the output let us say we have defined K filters of size or
particular size then it will give you K feature maps okay. Now if I do 3 × 3 convolution we
are actually using 3 × 3 × K size filter to operate on this feature, so it will be 3 × 3 but it
will act across the feature maps. So this is what we refer to as volume convolutions that it
takes… the number of input channels can be variable.
So typically you will only define 3 × 3 or 5 × 5 filters but it is implicit that those filters also
act across the channels depending on how many channels you have as input and all the values
in the filter kernel will be unique, so in the sense do not think that if you define a 3 × 3
kernel let us say or in this case for simplicity 2 × 2 kernel, let us say you define it like this
something of that sort okay, so this is not duplicated across the channels right so you will
have… so if we have 3 input channels there will be times 3, so this is 2 × 2 × 3 this is the
size of your filter kernel and this have as many unique elements as determined by your back
propagation algorithm. So their weights will be estimated that way okay. This is an important
point to understand because many beginners really falter here sometime to understand this
okay.
720
So how do we determine the size of the output volume and we have seen some hints so far,
we will just do it very systematically. See size of the output volume or the feature map it
depends on the size of the input and we saw that, the size of the filter kernel we are using,
how much zero padding we do we will see why we use zero padding and the stride in the
network okay.
So padded convolution, why would you want to do padded convolution? So we saw earlier
that when we do a convolution with the 3 × 3 kernel the size of the output was less than the
size of the input okay. Suppose we want to preserve the size, the reason is if you have
multiple convolutions layers…as you do more and more convolution at some point the size of
721
the feature maps will become so small that you cannot do any more filtering, so it actually
adds restrictions on how deep you can go okay.
So in order to avoid that we sometimes would like to do padded convolutions, so all we have
to do this is our input feature map size again we will just operate with one feature map at a
time. It automatically applies to a volume, input volume so it is very difficult to show it on
screen that way, so we have input image of size 3 × 3 . This very simple so if you apply this
3 × 3 kernel on this 3 × 3 image your output will be 1 × 1 that is typically what we get but
then you cannot do subsequent filtering on top of that it becomes difficult.
722
723
724
So to preserve the size of a feature map, so you will pad with zeros all around, so this
corresponds to padding of 1 which means that you will pad with one on the left right top and
bottom of the picture and if you do a convolution with this then you see that your output it is
very similar to what we did earlier you will position your filter kernel at the top left and then
move it around like you do for regular convolution except that now you have added 0
everywhere but then by doing this by adding zeros everywhere on the edges we get an output
feature map of size 3 × 3 okay.
So this is the idea behind padded convolution is that it helps to preserve the size of your input
of course you can add more zeros to it, it will be just slightly larger feature map you will get
there but you can add more zeros but typically it is done to preserve the size of your feature
725
map and zero padding is determined by the size of your feature kernel, so in general we will
just use square feature maps as input, so if the size of your feature map on one axis is N okay
we saw that if we use a filter kernel of size F then the size of the output is N − F + 1 okay.
So if we have a padding of size P in this case P equal to 1, N − F + 1 let us just write this
more clearly alright, so output is N − F + 1 + 2p okay, so in this case our original
N = 3, F = 3, p = 1, ⇒ N − F + 1 + 2p = 3 . So if you want to preserve the size of the
feature map following the convolution then you would want 2p = F − 1 ⇒ p = (F − 1)/2
which is another reason why we would like our f to be odd. It is easier to otherwise you will
get some fraction of values and you have to do round or slower or several adjustments have
to be made down streams you can work through that, so typically will work with odd size
filter kernel, 3 × 3, 5 × 5, 7 × 7 so on and so forth so that this calculations are simplified.
726
Another operation that we typically do or strided convolutions again we saw that with the
convolution the size of the feature map reduces, however it reduces gradually if you shift the
feature map by one every time, so the slide of 1 is typically what by 1 pixel every time, so the
stride of 1 is typically what we do, however if you have let us say a very large feature maps
and we want to subsample it quickly otherwise memory becomes an issue then you do stride
at convolution. It is very simple strided convolutions are very simple to understand except
just that you skip a few pixels instead of moving the kernel on the image one pixel at a time
you would skip multiple pixel at a time and how many ever pixel you skip is basically the
stride you are using.
727
728
729
730
731
So stride 1 is what we saw earlier is typically how we would go but for stride let us say 2 we
saw that for if you let us say if we want to do convolution stride of 2 then we saw… we
usually start from the top left right here okay and we just looked that we skipped this okay at
side of 2, so it moved 1, 2 pixels and then you get the corresponding output there. Once again
as you go down you will skip 1, 2 and you will position the kernel there and of course you
would do one more skip to get there, so typically you should have obtained 3 × 3 output but
because of the stride you will get a 2 × 2 okay, so the general formula for strided convolution
again giving the size of your feature map is n, square n.
N −F +2p
output size = s
+1
so in this case it is very easy to verify.
N = 5, F = 3 , p = 0 , s = 2
5−3+0
⇒ output size = 2
+1 = 2
you get 2 × 2 feature maps, so you subsample we quickly typically okay. You typically
work with odd sized odd number filter kernel 3 × 3 … not odd number filter kernel sorry, the
size of the filter kernel is pretty odd number so it is convenient to make this calculations
because you have to make sure that at some point you do not run into this factional value and
it will be difficult to resize your features maps.
732
So we typically work with this odd sized filters, so that it enables you to do these calculations
and end up with whole numbers rather than fractions okay, so to summarise briefly what we
have looked at so far? We have looked at convolution neural networks, we see that they
have… they are made up of a sequence of convolution and Max pooling layers leading to a
fully connected layer and the decision layer.
Convolutions are basically done by defining filter kernel is at every layer in your network, so
every layer you define k filter kernels and the convolution is done by moving the filter kernel
across the image at every point multiplying with the underlying pixel value and adding to get
the output. Max pooling is done to reduce the size of your feature maps, so we repeat these
layers eventually leading to a fully connected and the decision layer, so we have not seen
those yet I have also not mention that there is a non-linearity there of course because this is a
neural network, so following a convolution you usually do a point wise nonlinearity.
So what do I mean by that let us say for instance, so in this case I have done a convolution
with a stride. I would take these values, so this is what you would typically see in your ANN
as a W T X right where in this case that W is the… Or the elements of your filter kernel and
the X comes from the pixel values here, so you would pass it through a nonlinearity. Like a
ReLU for instance, so every activation in the output feature map will be put through a
nonlinearity so that layer is always there.
So convolution followed by nonlinearity is typically done that is a usual thing, so you have
a…so if you…in the deep learning parlance you will call it a linear layer which is basically
W T X as we saw that, so are just some of the products followed by a nonlinearity, so
convolution nonlinearity followed by Max pooling and so on, so this typical sequence of
operations that are done in a convolution neural network.
So we will in the next few videos look at a typical construction of a convolution neural
network. See what the layers are and how we define these layers, how we would define
number of kernels in every layer and what it would look like and how we would progress to
let us say a fully connected layer and then to a distant layer or can we skip the fully
connected layers or not? That is another aspect that you would like to look at, so this will
cover in the next series of lectures.
733
Lecture 50
Types of Convolution
In this video we will look at the different types of convolution that are typically done in a
convolutional neural network and especially focus on the dilated and transpose convolutions
which are often used in deep networks and also in networks that have the encoder decoder
type of architecture.
734
So just to recap is what we call naive convolutions, so this is the typical convolution
operation that is done on a deep neural network. So we have a given parameters:
a. Kernel size = K w , K h
b. Stride = S w , S h
c. Padding = p
the output feature maps width and height are calculated by this formula, so it is the size of the
feature map minus the size of the kernel plus 2 times the padding divided by the stride plus
one that sees width, so the typical same formula applies to the height also.
I w −K w +2p
Ow = Sw
+1
I h −K h +2p
Oh = Sh
+1
So this is the operation that we typically see this is involves superimposing the filtered kernel
which shown in red here over the green input volume and then striding it across the input
volume. So just to illustrate so if you go back that is the first element calculated here by
superimposing the 3 × 3 volume and then once again we do the same by moving the filter
kernel by one stride of in this case the stride is actually 2 and though it is written as 1 here, it
735
is actually 2 slide of 2 and we repeat this calculation throughout the cross section of the
feature map.
736
Now dilated convolutions again or another that convolution that are used to increase the
receptive field of the convolution given the same figure given a smaller filter size, so it is a
cheap way of cheap in the computational sense way of getting a larger receptive field using a
smaller feature kernel filter kernel. So for instance in this image we have this input feature
map in green and the filter kernel is of size 5 × 5 and as we saw earlier we just move the
filter kernel across the feature maps in order for us to obtain the output feature map shown
here, ok.
Now the idea is to use a 3 × 3 convolution with the dilation factor that can visualize the
same area as a 5 × 5 convolution while the number of parameters remain the same that is
basically you have 9 parameters instead of 25 parameters used in a 5 × 5 convolutional
kernel however by having a dilation in there we can you we can get the same receptive field
size.
So how do we accomplish that? So what is shown here in this image is that there is this these
red elements correspond to the elements of a 3 × 3 feature map and by you know
incorporating appropriately rows and columns of zeros, we are making the size of the feature
of the filter kernel 5 × 5 and then proceed to do convolution as before, say so this is very
advantageous in the sense that you will add with little computational expense you are able to
get the same receptive field size.
737
If we recall as we seen earlier in order to get a receptive field size of 7 × 7 , it is the same as
during 3, 3 × 3 convolutions in succession so that will be an increased number of operations
while in this case we get directly get a 5 × 5 receptive field just by adding an appropriately
adding zero zeros and zero columns in the rows and columns of your filter kernel.
The number of rows and number of zeros zero rows and zero columns that you add as they
first referred to as the dilation factor ok. So just to recap the idea is to inflate the size of your
kernel by inserting rows and columns from zeros, so that you can get a slightly larger
receptive field.
The next topic is the transposed convolution, here the idea behind transpose convolution is to
aid in increasing the size of the output feature map and these are typically used in encoder
decoder networks especially on the decoder size so as to increase the size of the feature map
as you go towards the output ok. The idea is to regain the original spatial resolution not to
regain the actual feature map itself but just to have a just to regain the original resolution that
is the size of the feature map and you have to appropriately pad with zeros and also insert
zeros and rows zeros in rows and columns of your input feature map in order to obtain the
appropriate sized output, ok.
738
The good way of thinking about it is to see that it is how the idea behind the transpose
convolution is we interpret the input feature map let us say the input feature map is of size
2 × 2 ok and you also been given a kernel size ok let us say this is a 3 × 3 ok, now we want
to get a certain sized output ok it is given let us say this is some m cross m we do not care ok,
so what we have to do is we have to understand this just as you assume that the input that you
are given to the transpose convolution this is the input to the transpose convolution is an
output of as a direct conversion ok.
So we have to find out what feature map size when operated upon with the kernel size of
3 × 3 ok given the padding and slide or given we will give this output size ok and dilated
conclusions we try to regain that particular we try to get the that particular input resolution
which is basically trying to reverse that operation, here once again to reiterate we do not seek
to get the input and in fact reconstruct the input feature up but just to regain the resolution.
The example given here, so you would you want to reconstruct this size right or you want to
reconstruct this resolution 5 × 5 is your target output and your input is actually a 2 × 2
feature map given by the screen squares, so it is actually a 2 × 2 input your filter kernel is of
size 3 × 3 given by this grey ok and you seek to obtain a 5 × 5 output ok, the stride
parameters are given so if you think about it the idea is what filters what input feature size
when convolved with a feature map with the filter kernel of size 3 × 3 would give rise to a
2 × 2 output right, so that is what we would we would have to figure out.
So they turns out that this it is 5 × 5 and the idea is we in order to regain this 5 × 5 size we
have to actually have these zeros zero padding on the outside so zero padding of 2, and we
also need to have this zero columns and zero rows inserted into the input feature map of size
2 × 2 and then proceed with the convolutions as before ok. So for instance let us just
understanding this better let us say your input size is 5 × 5 ok and your kernel size is 3 × 3
if you do a naive convolution then you would get the output size would be 5 − 3 + 1 = 3
which is 3 × 3 ok, so which is not what we want.
So if you want to get a 2 × 2 output what do we have to do? So let us say we have a stride of
we add a stride of 2 in which case your output size would be 2, 2 × 2 ok, so if we have a
739
stride of 2 no zero padding and we have an input feature map of size 5 × 5 and you and your
filter kernel is of size 3 × 3 then you decide of 2 your output feature map would be of size
2 × 2 , ok.
So now we want to reverse this operation, so since we have a side of 2 we insert 1 or S − 1 ,

we insert 1 row of zeros and one column of zeros into the feature map of size 2 × 2 right and
there was no zero padding here so which means that we act so everything is reversed so that
to understand this so since there was no zero of paddings here we have to do a zero padding
when we do the transpose convolution if you think of it that way.
So we have a padding of 2 right, so in order to get the appropriate size which mapped on your
stride this one as before so and you stride is again 1, so if you do the convolution as like a
naive convolution or your usual convolution with this particular input size then you will end
up with you 5 × 5 feature map ok, so this transpose convolution is typically is the one that
is often used in encoder decoder networks or in any situation where you have to upsample
your feature maps.
Another way of looking at it is that let us say you have a network right and bunch of feature
maps, this is CNN let us say typically where a bunch of feature maps ok, this is your forward
pass through the network ok, so you get a your output right so we know that to get from the
input output it is basically a sequence of matrix multiplication, so if the weights in every
740
layer is this W 1 , W 2 , · · · so on just to be slack with the notation here so the output will be a
sequence of multi matrix multiplication right W 1 X and then you would have so on and so
forth right, this case there are L layers then you have W L X right, so this is your output so
during forward pass.
Now during the backward pass or back prop your gradients are propagated by transposed
matrix multiplication right, so error so the error you by you back propagate in this direction
so as you notice if your back propagating the error from the smaller size feature map to a
larger size feature map which means here you are actually doing an operation which is same
as the transposed convolution ok, that is the way of understanding transpose convolutions.
741
CNN Architecture LeNet and AlexNet
Part 1
Hello and welcome back, in this video we will look at LeNet.
742
So LeNet 5 as it was called dubbed was first published reported back in 1998, it was one of
the earliest instances of convolutional neural networks used for image recognition, it is
specific application was for digit recognition and apparently had commercial application
where it was used to read millions of checks in banks. So this network serves as a kind of
template for most of the modern networks that we see today.
So we will just briefly look at it and see some of the salient features of this network. So this
network took as input what 32 × 32 images these are in this network was trained with images
from the MNIST database, so the images were about the size 28 × 28 and then these images
were then further modified to in fact kind of like a data augmentation when there is a several
distortions were introduced and was for training this network.
So it took us input images of size 32 × 32 , it is basically images of digits handwritten digits

which has been discretized scanned and discretized. So the typical architecture is basically
input followed by a convolution followed by a pooling layer and this was repeated leading to
finally to a couple of fully connected layers and an output which is basically one of ten
classifications you have to classify zero digits 0 to 9.
Network at about 60,000 parameters it has several interesting concepts here for instance the
first layer had 5 × 5 convolution no zero padding which leads which gives rise to a 28 cross
28 output followed by a 2 × 2 pooling this is an average pooling operation so basically
2 × 2 average pooling with a stride of 2, so the average pooling basically the output is
basically the average of the four elements in that 2 × 2 area of the filter then followed by a
5 × 5 , 5 × 5 convolution again no padding which gives rise to a 10 × 10 output and then
subsequently another 5 × 5 convolution gives and then a max pooling which gives rise to a
16 × 5 × 5 maps.
Now when we do a 5 × 5 conviction on top of these maps it is basically the same as doing a
fully connected layer but then the paper actually refer just call this a also calls this a
convolution layer is also a convolution layer, so we craft we start with the 32 × 32 input
here perform 5 × 5 convolutions followed by max pooling and then one more 5 × 5
convolution followed by a max pooling ok which gives rise to 5 × 5 feature Maps 16 of them
743
and we do again if 5 × 5 convolutions on that give giving rise to 120, 1 × 1 outputs that is
the interesting part, the author also mentioned LeCun is the first author of this paper he
mentions that this is what he need, right now we call this as fully convolution layers right.
So if we have an input which is bigger than 32 × 32 then the number of feature maps in this
the size of the feature maps in this layer would be higher than 1 × 1 in fact if you think about
let us say a 64 × 64 input one way of looking at it is that we can we can stride the entire
network across the 64 × 64 with a stride of 32 and we can actually get 2 × 2 output here, so
if you can think about it that way then finally a fully connected layer and then to a decision
layer or the output layer we can use soft max here if you want ok.
So this has several things which are repeated even now in current networks one is that as you
go deeper into the network starting from the input the size of the feature maps typically
shrink because of the convolutions also by the pooling operation which is actually
subsampling operation and not only that as the size of the feature maps it decrease the
number of feature maps increases, so in the end we have 120, 1 × 1 feature maps and the first
layer we have about 6 featured maps of size 28 × 28 .
So this general principle is you will see is reflected in along in the current architectures also,
so this network achieve state of the art results in 1998 for digit recognition based on the
MNIST digit database.
744
In this video we will look at AlexNet which was this work was done in 2012, so AlexNet was
the entry to the imagenet large scale image recognition challenge and this was among the first
CNNs to be you know entered into the challenge and it actually beat all it is nearest
competitor by more than 10 percentage points ok, so this was say deep neural network with
about 7 layers ok and 60 million parameters ok, so this was the structure of this network is
the architecture of the network is very similar in terms of a very similar to LeNet 5 in terms
of the convolution and max pooling operations but then it had various other innovations in it
as we will see in this video.
So just to give a brief overview of this network so the imaginet challenge the input these are
RGB images so the input layer at size 224 × 224 × 3 some sources have pointed out that it is
Act should be actually 227 × 227 in order for the output in the second layer to be consistent
ok, so the first layer first convolution layer had 11 × 11 filter, 11 × 11 filter with the stride of
4 giving rise to 96 feature maps of size 55 × 55 and then we have a max pooling operation
here max pooling operation using a 3 × 3 kernel and a stride of 2, so which needs to
effectively halving the size of your feature maps, again followed by 5 × 5 convolutions.
So this is the typical structure so we have all max pooling operations are 3 × 3 with the stride
of 2 and all the convolution operations are again 5 × 5 with a stride of 1 ok, so the layers
you see here in the intermediate convolution layers you see here without any max fully they
had convolution with the padding 1 to preserve the size of the network and in this case so I
really mentioned that 5 × 5 for almost all of the convolution operations but in these
intermediate layers the filter kernel size was 3 cross 3 for all these 3 ok.
And they had a padding 1, so in these boxes the intermediate convolution layers are shown
here has their convolution filter kernels of size 3 × 3 with a pad of 1 to preserve the size of
the feature maps forward by max polling and then a fully connected layers 2 fully connected
layers and an output which is one of thousand, so the imagenet classification challenge
provides you thousand image categories and you have to classify them as one of thousand.
So the pretty deep network at about 7 layers so if you count it is about 1, 2 about 5
convolution layers about 3 max pooling ok typically about 7 layers which included
745
convolution as well as the densely connected layer, so if you if you just count the convolution
plus the fully connected layers you will have about seven of them, they this network also had
a normalization layered layer normalization where it is called local contrast normalization
this was done by if you look at a particular pixel or an activation in a feature map you
normalize it is value by looking at the adjacent feature maps at the corresponding feature
location.
So this was but of course this is no longer done this is one this was the one off thing has done
for this particular realization of the network. So another thing that this network again had you
know if you look at this network lot of the computations happen in the earlier layers because
you have 11 × 11 convolutions 5 × 5 , 3 × 3 convolutions ok but if you look at the fully
connected layers in the end lots of parameters here most of the parameters come from here
because you see if you look at these two fully connected layers this is like 4096 × 4096
weights ok.
And if you look at the max pooling following this is what 256 × 13 × 13 as input to the max
pooling layer with the stride of 2, so you will have about 256 cross in this case if you have
pooling 3 × 3 you have took first 5 × 5 I think and you unroll this and follow up with a fully
connected layer to a 4096 activations ok. So lot of the parameters occur towards the end and
lot of the computations are at the you know input side of the layer of the network because of
the size of the convolution columns.
So if you see this network actually has a mix of 11 × 11 , 5 × 5 , 3 × 3 convolutional kernels.
746
So let us just look at this convolution layer at the input, so this one aspect of this network
which is now highlighted in this figure is that the GPUs available at that time the one this
particular network was trained had only 3 gigabytes of input of memory available, so the
computations for this network were split across 2 GPUs, so if you look we have split the
network from the previous stride into two pathways.
So basically the top here all the computation in the top half here would go to one GPU and all
the computations at the bottom of here will go to another GPU ok, so for the first layer we
have 96 filter kernels as we saw so 48 of them in 1 and 48 of them in the other, ok. So if you
look at the number of parameters here it is very easy just to show an illustration how to
747
calculate the number of parameters let us say on just one pathway we have each filter has
11 × 11 parameters ok, so we have about 48 of them in one ok and there are two pathways in
this case GPU 1 again and GPU 2 so two parameters per convolution layer in one convolution
layer, ok.
And it produces 96 off course feature maps so we have to that is why we have 48 × 256
which feature maps ok. So the size of the feature maps is 55 × 55 so these are a number of
parameters, so if you want to calculate the number of computations that you have to do per
GPU for let us say number of computations per GPU so you have the output size is 5
55 × 55 × 48 those are the total number of activations in the on one GPU in this particular
convolution layer and for each of these we have to calculate the are this is a number of for
each one to produce each output we have to do 11 × 11 × 3 computations, right.
So this is the number of outputs activations so basically 55 × 55 × 48 and for each output we
need to perform 11 × 11 × 3 multiplications. I have you know the additions that we have to
do because it is the sum of products you can actually you can put in a factor there for it if you
want to but this is a typical way to calculate number of computations, of course the network
is spread across two GPUs, so this is for one half of the feature maps so you have price as
many okay.
So this is the typical computation number of computations that you it is how you would
calculate in the network and the number of parameter you see the number of parameters is
quite less actually in initial layers right and if you go to the final layer again I say short before
in the earlier slides you can at least see for between the two fully connected layers you can
calculate so it is 4096 × 4096 this is the number of parameters, so that is your weight matrix
for a fully connected layer which is very similar it is how you would calculate for a regular
artificial neural network, of course the number of parameters coming from the pooling layer
to this layer is again is much higher because after the pooling you would have to unroll it and
then do make it into a fully connected layer ok.
So this is typical computations like when so a lot of the number of parameters here are huge
there is a number of parameters that I calculated ok but a lot of the computations happen in
the earlier stage also ok. So there is another innovation here I say you referred I said that as
748
two pathways that is because if you look at how the training are spread across two GPUs you
see that except for the inner layers where I have the cross arrows these are the layers where I
have cross arrows in all the other layers the ones here and here the computations are restricted
to the feature maps in that GPU.
So this has so let us say if you look at this particular layer this has 48 feature maps and we do
a 3 × 3 pooling but the pooling operation only draws from these feature maps and when you
do the subsequent convolution with the 5 × 5 kernel then in the 5 × 5 kernel actually acts
across only this region that is why we call it the multiple pathways or two pathways because
and if you come here in this layer then the output activation here also has feature Maps
includes feature maps from both the GPUs.
So this one layer we have computations are or involve both the GPUs or are from feature
maps in both the GPUs while in other layers they are done independently, so this is like
having a separate pathway 2 different pathways in the network ok and this is one of the
networks the first time use of ReLU we saw that an earliest slide ReLU non linearity, it also
used dropout regularization we saw that earlier dropout legalization it also had dead
augmentation so data was augment images were augmented by because you see the number
of parameters is huge there are a few million training images but we have 60 million
parameters we do need more data if you have so many weights.
It also used L2 regularization ok, so that it does not know over fit to prevent over fitting some
extra so dropout and L2 where circulizers see the dead augmentation was done by a flipping,
shifts, translations as well as jittering the rgb values so they have done dynamically on the fly
as network was training ok. So the network was placed first in the imagenet recognition
challenge especially in the top 5 and the top one category it had an error of about 15 percent
which was much higher than the second place finisher again one of the earliest and the first
CNN to win the image that competition.
So this network sparked a huge interest in deep learning so it had lots of parameters was quite
deep seven layers and if you include the max pulling also, seven weight layers if you can call
it that because it had 1, 2, 3, 4, 5, 6, 7, 7 weight layers speak off and of course in interspersed
with 3 max pooling layers ok and it had a mix of filter kernels 11 × 11, 5 × 5 and 3 × 3 kind
749
of systematic and if you also look at how the number of features increases as we go deeper in
the network.
So the initial layers we had first layer we had 96 and then 256 here because I am adding from
both of the GPUs 384, 384, 384 was preserved ok, so as you go deeper in the network the
number of feature maps increased, the size of feature max shrunk but the number of the
representations increased ok, so this is a typical architectural design that most networks you
see one now ok.
So you would as you go deeper in the network they would definitely be a decrease in the size
unless you do 0 padded convolutions appropriate 0 padded convolutions but then you can
always increase the number of feature Maps to improve your representation, so that is the
general principle it was also seen that in LeNet 5 but it was much smaller in size because of
the size of the input images as well as the as the problem is concerned, so it was just to
recognize small images ok and there are also constrained by the computational resources
available at that time.
So this was the AlexNet architecture ok, so in summary so AlexNet was the first network to
be CNN to be used in use for the imagenet classification challenge produce state of the art
research at that time ok, so it was about accuracy of 15 percent had a huge number of
parameters about 60 million of them most of the parameters towards the decision layer so
basically the fully connected layers contributed to most of the parameters had the design
principle was to use multiple size filters here 11 × 11, 5 × 5 and 3 × 3 and the number of
feature maps increased as you go deeper into the network ok.
It had two pathways the authors also report that having this multiple pathways actually
improved their results, so two pathways in the sense the computations were split across two
GPUs but if you see that as we saw the combinations were not shared ok, so except in one
layer one or two less especially here not one layer this layer here and towards the end to the
fully connected layers that is where the computations was share, ok.
Other than that you can call this as two separate pathways ok, the results were reported based
on an ensemble of about seven networks multiple networks more than 5 networks ok, so that
lead to also an improvement about two or three percentage points so multiple and the layer
750
normalization was also introduced here but this particular layer normalization here is no
longer used but they did have a normalization layer in between the conditional layers just
before the input the convolution layer ok.
So now we will look at in the next subsequent videos, we look at other architectures that
performed that gave state of the art results on the imagenet database.
751
CNN Architecture VGG Net
Part 2
We will look at the VGG or network specifically the VGG 16, so VGG stands for the Visual
Geometry Group at Oxford University. So this particular network was entered into the 2014
imageNet challenge ok, so it had 16 weight layers ok 16 layers with this one we are looking
at a 16 weight layer they also had another version which has 19 weight layers so very this the
design of this is very similar to LeNet and AlexNet ok except that they have made this a little
bit more systematic, so we will see what the systematics is here.
So we saw one of the in the earlier videos we saw that it is best that as you go deeper into the
network we increase the size of the depth or the number of feature maps increases so it
becomes wider network becomes wider ok, so that that was incorporated here they also stuck
to one filter size 3 × 3 this is the smallest meaningful filter size, so 3 × 3 gives you I can
also do 1 × 1 convolution but of course we need a receptive field so 3 × 3 is the small
they stuck to 3 × 3 filtered throughout all the layers.
752
There are a lot of parameters of around 130 to 140 million parameters depending on which
network you are looking at once again we will just walk through the network briefly and then
we will see what are the advantages given by this network, so if you see the input again same
as 3 channel RGB image size 224 × 224 so the first layer has to 3 × 3 convolutions in
succession followed by a max pooling layer which reduces the size of the feature maps to
112 × 112 and each of these 3 × 3 convolution layers at 64 feature Maps followed by
another set of convolution and max pooling so on and so forth till we get to the these 3, 2
layers which have again 3 × 3 convolution but this time 3 of them in succession.
If you see here followed by again a max pooling and here of course these are rasterize and
made into a fully connected layers here these two and the output is a 1 of 1000 classification.
Now this as if you look at this we will see that this is a block it is one of the earliest networks
to introduce this kind of concept when you see another network which most contacts now use
this.
This is a block of convolutions, so basically instead of your layer is now been replaced by
yeah a bunch of convolutional layers ok, a sequence of layers is now used as one layer. So in
between these two convolution layers there is no max pooling but non linearity is still there,
so 3 × 3 non linearity and then 3 × 3 non linearity ok that is the sequence ok in this case 2
, 3 successive nonlinear piece of or are applied following the convolutions.
So what does this provide in terms of an advantage so if you look at a succession of 3 × 3

convolutions one is the non-linearity leads to greater discrimination ok seen that we have
looked at other classifiers so if we have way nonlinear features we expect that at some point
the class has become linearly separable that is but then how do we incorporate this here
network is to have a succession of nonlinearities hopefully giving rise to better discriminatory
power in the network.
The second advantage is in terms of the receptive field ok, the authors also mentioned that if
you look at this particulars what we can just come here look at this for instance if you look at
these 2 , 3 × 3 convolution ok so this is 1 convolution layer with the 64 feature maps 3 × 3
convolutions again 64, 3 × 3 followed by another set of 64, 3 × 3 convolutions ok.
753
Now if you look at the receptive field of this layer of this block so the first one as a receptive
field of 3 × 3 ok so then new size to 64 feature maps right, so now if you look at the second
if you look at the this is the first one if you look at the second one ok the filter kernel size is
3 × 3 ok but the receptive field on the original image so we are looking at the receptive field
of this particular layer on the input side ok so the receptive layer field for this one is 3 × 3
and for the second it becomes 5 × 5 it is not too hard to see because if you look at the if you
look at the second convolution the feature maps in the second convolution layer ok if we have
3 × 3 ok.
So if you look at this particular look at this particular activation in the feature map then it is a
get itself is a result of a 3 × 3 convolution, so this itself is a result of a 3 × 3 convolution
right, so you have 1, 2 and 3 here right so this particular feature map in the second 3 × 3
convolution layer itself is the output after 3 cross 3 convolution from the previous layer, so
this is looking at 3, 1 2 let us say so then we see that all around we will have to add one more
row and column ok.
So it is receptive field on the input is actually 5 × 5 so similarly if you look at the succession
of 3 convolution layers so the third the final receptive field for this part the receptive field of
this particular convolution layer would correspond to a 7 × 7 and you can verify that on
your own. So what does this provide in terms of you know computational gain that is what
the authors claim or the authors claim this what it that is the computational gain you get from
this is that if you want to do a succession of 3 × 3 convolutions 3 of them the number of
computations involved to produce one element in the output feature map is much lesser when
compared to if you just do a 7 × 7 convolution ok.
So this is the and at the same time you have a larger receptive field on input sometimes it is
desirable and in many cases is desirable to have a larger receptive field. so that you get more
of the context in the image into your activation maps but doing a much larger convolution
using a much larger convolution means that the number of competition increases, so one
advantage of using a succession of 3 × 3 filters convolution layers instead of just using a
filter with a larger receptive field is the savings in the number of parameters.
754
So for instance if you had used 7 × 7 filters on C channels then the number of parameters
would be 49C 2 on the other hand if you are using if you would use a succession of three
convolution layers to get the same receptive field with 3 × 3 filter sizes then the number of
filter sizes would be about 27C 2 , ok. So this thing gives rise to it is a decrease in the number
of parameters actually this might not be very obvious but you have to do the computation to
figure out because if you if you have C input channels then we for so how do we get this
number if you have C input channels then the size of each filter would be if you use a 3 × 3
the number of elements in which should filter would be 3 × 3 × C right and then if you
have C output channels then we have one more C outside that is the total number of
parameters ok.
Then we have a section of three layers with 3 × 3 filters so if 3 times that ok, so this is how
we calculate the number of parameters in when you use a succession of 3 of course the
similar calculations for 7 × 7 ok. So VGG net by using the succession of convolution layers
with very small filter sizes but of course the number of parameters is much higher than the
AlexNet but the results were much better which is about 7 percent in the top five 5 percent
error rate in the top five which was which is right in fact that was not the winning it was not
the winning entry but it won in other categories in the challenges he may localization etcetera
but the error rate was still pretty good for shorts around 7 percent, ok.
So the winning network was actually the inception network or Google net from Google we
will see much more about it in the next videos.
755
Indian institute of Technology Madras
CNN Architecture (Google Net)
Part3
In this video we will look at GoogleNet, the basic building block of this GoogleNet is the
inception module again the 2014 winner of the imagine a challenge, so let us look at the basic
building block of the inception model so for instances he had for the VGG network that we saw
in the previous video the basic building block was a block of convolutional layers so see very
small filter sizes and if you recall Alex net, Now Alex that had variable filter sizes, so each first
layer and eleven cross eleven and then we had five cross five filter and we had three cross three
and then a fully conditioned, so the inception module incorporates both this concept since that
every layer has all possible filter sizes, so they build a block a convolution block which had
multiple filter size and we let the network the backdrop the learning to which network learns and
let the backdrop decide which weights to update based on your objective function.
756
So let us just quickly look at this inception module so if you look there are two images so this
here is what the authors of the GoogleNet paper called the Naive implementation and this is the
actual implementation that they are followed so what is difference so we have an input feature
map this is the thickness of the feature map right here that means a different color, so this is the
thickness of the feature map as I have pointed out here the size some K feature map and we have
1 × 1, 3 × 3, 5 × 5 convolutions and 3 × 3 Max pooling also applied to this feature maps the
outputs each of them are taken and concatenated so they are gone getting it across.
So again this means that we have to do that convolutions so as to get the same sized filled feature
map from each of these convolutions so there are all zero padded accordingly and to get the
correct size which are enough, so that they can all be concatenated, as you will see doing this as
a problem because the number of computations becomes huge so if you just show this is the
naive implementation so but doing it this way, Implementing the block the inception module as
they call it this way has issue in term of the number of computation involved it becomes huge
because we have large filter sizes 5 × 5 also in that, we can also use I think it is possible to use
7 × 7 also but the authors of the GoogleNet inception module just stop with 5 × 5 probably
work very well for the image challenge okay.
So what does the better implementation, the better implementation is to use 1 × 1 convolution
to reduces the size of the filter maps, since when I say size in this context it is the depth of the
feature map, so if you are input layers has let us say, the input layer has 28 × 28 cross let us say
192 inputs so many feature maps so in this case I mean I had it down here so that is where we are
going to do the 1 × 1 × 192 , so we would use the 1 × 1 convolutions so project this number,
number of feature map to a smaller number to a smaller volume so what I mean by that, we know
that as if you recall we know that 1 × 1 feature maps to preserve the size of the feature maps in
essence in plain the XY size of the feature maps are preserved but we can use 1 × 1 convolution
to reduce the depth of the feature map so that is particularly precisely what the inception model
accomplishes so, so if you have a very large in a sense of a very large number of featured maps
in the input volume you would use this 1 × 1 conclusion to bring it down so for instance in this
case.
757
Let us say we take the 1 × 1 convolution before the 5 × 5 we can use the 1 × 1 convolution to
reduce the size of each other up to 28 × 28 let us say 16, and then we can output let say from the
5 × 5 convolution so you can output 28 × 28 × 128 feature so just to recall how this is done
is by defining 16, 1 × 1 convolutions of course 1 × 1 convolution act across the volume across
the featured maps so you can also question whether this would you know lead to loss of
information so this seems to be a parameter they have tweaked the number of feature maps here
this has to be done by some kind of cross validation to see which is the most optimal so this
1 × 1 convolutions prior to doing the larger convolution with in a sense larger convolution in a
sense of conversions with larger filter sizes so the 1 × 1 project the input volume to a smaller
dimension and then subsequently you use three cross three convolutions to put them back.
So this is what is referred to now as bottleneck, bottleneck layers so you string the size of the
feature map that way it works similarly for the max pooling also you can do because max
pooling would in most preserve the size if the feature maps the depth of the feature maps that is
and you can use again 1 × 1 coefficient to reduce the size so this was the begin the two of the
big innovation in an inception module so instead of just so if you saw in VGG that each block
they divided the network into block so each block had a succession of condition so in this case
for inception module it incorporates multiple conclusion kernel sizes just like in AlexNet we saw
that we used 11 × 11 and 5 × 5, 3 × 3 but in this case it is 5 × 5, 3 × 3 , 1 × 1 and a max
pooling layers all in one wall in one module and you also use this bottleneck one cross one
convolutions to project the input volume to a lower dimension in terms of the depth of the
volume before we do the 3 × 3 convolution and the other convolutions with higher filter kernel
size.
758
So you just look at the network this had about twenty two layers weights and so there was an
initial set of convolutions and max pooling which reduced the size of the input to 28 × 28 × 192
so the input as usual was a 224 × 224 × 3 it was followed by sequence of convolution and max
pooling sequence of convolution and max pooling to get it to 128 × 128 × 192 and then that was
used as input to the inception layer sequence of inception layers followed by max pooling again
sequence of inception layers followed by max pooling so on till we have the typical output with
one of thousand activations, so it had so they have labeled the inception module as 3A, 4B this
was if you read the paper I urge you to go read the paper where see that they have labeled each
of these convolutions the inception modules by you know 3A, 3B in this case for 4A, 4B and
there is corresponding cable which tells you how the computations are done in that particular
module.
So we will just walk through one inception module that is 3A and see how the number of the
saving in the number of computations by using a 1 × 1 bottleneck to reduce the size of the field
reduce the number of feature maps so this is a 22 layer network but then it had very few
parameters about five million parameters and it won 2014 ImageNet challenge which has top-5
error rate of about 7 percent it is slightly better than VGG but with much-much lesser number of
parameters so this was considered this again is one of the networks which have and if it is very
759
small not very deep but in terms of number of parameters very less number of parameters the
contrast to let us say Alex net as well as VGG Net, VGG16 has 138 million AlexNet at 60
million parameters but this only had 5 million parameters or weights in network so we will look
at this inception 3A right here which takes us input 28 × 28 × 192 and the output is I think
28 × 28 × 256 .
760
So just look at it so we reproduced a piece of the table here I urge you to go back and look at the
table so this is the output size of the max pooling layer which feeds the as input to the inception
This is the input the inception module three and this is the output of the inception model 3A.so it
has 64 , 1 × 1 convolution so we saw that if you go back and write it down there just to see and
then it had 96, 3 × 3 reduce in that table means the number if 1 × 1 convolution done the 96
feature maps were produced by the 1 × 1 convolutions prior to doing the 3 × 3 stage so you
would produce 128 feature maps with 3 × 3 again 16, 1 × 1 feature maps produced by the o
1 × 1 convolution followed by 32 and then 32 from the max pooling layer, so the output of the
max pooling layer if you go back if you recall some of these numbers so we would do here so
what it mean here is the reduce is basically here so prior to 3 × 3 we would have 96 feature maps
here and prior to 5 × 5 will have 16 feature maps and there then the output of the max pooling
would give you give you we saw 32 feature maps,
So and if you look at the output of the 3 × 3 convolution I think this has about if can go back
and look I do not want to go back and forth again so the 3 × 3 produced about 128 and if 5 × 5
produces 32 right the pooled projection layer produces another 32 there is a max pooling layer
and the 1 × 1 convolution itself produces 64 this is from 3 × 3 of course there is 96 coming in to
this I am not mistaken yes the reduce is 96 and reduce is 16 here but this is the 1 × 1 convolution
this is the 3 × 3, 5 × 5 and for the pool projection there is nothing for the other one there is
1 × 1 convolution this is directly from the input these two are from the input.
So the #3 × 3 × 3 is to reduce the number of feature maps produced by the 1 × 1 condition

recall that I will once again back here recall that the 1 × 1 condition are done prior to the 3 × 3
and 5 × 5 and the 1 × 1 convolution following the max pooling there is no nothing before then
there is one plain 1 × 1 convolution layer, so which is what we have here so this is 64, 1 × 1 64
feature map produced from 1 × 1 96 by the 1 × 1 prior to the 3 × 3 and then output from the
3 × 3 is 128 and output from the 5 × 5 is 32 but prior to that you reduce the dimension to 16.
761
So what is the savings in terms of computations so if you look at the following output of the max
pool which is the input to the inception module is 28 × 28 × 192 now if you directly do 192
3 × 3 convolution of course then the size of the number of parameters in every filter is
3 × 3 × 192 and then we produce let us say 128 feature maps that is 28 × 28 × 128 then the
number of operations would be about 173 million, now if we do let us do the 1 × 1 projections
into a smaller volume if you do that then we produce 96 feature maps following the 1 × 1
convolution and then we follow it but with the 3 × 3 convolution to again produce 128.
So we have to do this calculation here for the 1 × 1 convolution so the easiest way to this is
again very easy to write it down I will write is down for the thing and then ok, so if you have to
do 3 × 3 convolutions to produce 128 feature maps of size 28 × 28 so the number of elements in
the output feature map or the number of activations output map is so much right and then for
each feature for each output activation we have to do how many computations 3 × 3 × 192 sum
of products right so 3 × 3 × 192 that is pretty much what you see there ok that comes to 170
million. You can repeat the calculation here, here for this one too so the output feature map here
for the 1 × 1 convolution is 28 × 28 × 96 this is the total number of activations produced for
each activation we have to perform 1 × 1 × 192 multiplication so that is what we have here and
762
then similarly for the 3 × 3 convolution we produce 28 × 28 × 128 right and for each one of
these activations we have to perform 3 × 3 × 96 multiplication, so that is number is here we add
you see that there is a reduction in the number of computation that you have just for this one
particular feature map set of feature plums, so this was the innovation behind the inception, so it
does two things it let us the network decide which feature maps are more relevant based on the
back propagation or the optimization at the same time she saw earlier again even with VGG net
we saw that using larger size of type field means that the number of parameters increase in
number of computations also increased correspondingly, so that is solved by using bottleneck
layer using 1 × 1 convolution where 1 × 1 is used to project your feature maps to a smaller
dimension here the reduction is along the depth of the feature map so the volume becomes
smaller.
So we will look at the inception 3A layer, so the table is reproduced from the paper it tells you
the number of 1 × 1 , 3 × 3 convolution etcetera, in every layer, so just let us go one layer by
layer the input to the inception 3A layer of size 28 × 28 × 192 layer the output has the same size
28 × 28 × 256 so now it has number of 1 × 1 convolution is 64 so there are 64 here, The
# 3 × 3 × 3 reduce refers to the number of 1 × 1 convolution maps produced by the 1 × 1
convolution preceding the 3 × 3 to which means that this 1 × 1 convolution layer will produce
763
64 feature maps of the same size 64 feature map and the 3 × 3 convolutions will output 3 × 3
will output 128 so that those are the two the # 3 × 3 × 3 reduce here refer to the convolution
layer 1 × 1 convolution layer preceding the 3 × 3 convolution.
So it refers to this one right here so the number the number of feature maps produces as output
there is 96, these serve as input to the 3 × 3 convolution to the 3 × 3 produced an output of 128,
which is concatenated there similarly the hash 5 × 5 reduce refers to the number of feature maps
from the 1 × 1 convolution at 16 and the 5 × 5 itself produces 32 feature maps the pooled
projection layer produces 32 followed by 1 × 1 convolution so that remains fixed in this case so
the total number is if you add these up if you add 32 these numbers these are so if you put these
together you get about 256 output feature maps so this is just for one inception block so you can
work through the table in the paper and see if the calculation are consistent with the structure that
I showed you earlier.
764
CNN Architecture (ResNet)
Part 4
In this video we will look at ResNet, so the previous video we looked at AlexNet, LeNet as well
as VGG now those layers for instance, AlexNet that at seven layers, VGG had sixteen weight
layers and if you include max pooling about three or four max pooling layers in it as well as and
if you look at inception at 22 layers so they are much deeper they as you see that they depth
seems to increase as if you look at the progression so AlexNet for seven VGG was 16 or 19
inception was 22 they had other versions which are much deeper as well, now we come to this
part rest net, where rest net was different from the networks we have seen so far is that the
number of layers she have just increased dramatically so these network had like 30, 50 are up to
152 layers and there are reports of thousand layer ResNet is being trained.
So what is the principal and of course the ResNet was the winning entry to ImageNet recognition
challenge 2015 giving raised to error top 5 of less than 4 percent about 3.6 percent better than
765
humans, human error rate the top five, 5.1 percent about approximately. So, ResNets where the
networks which we really tried out with very-very deep network in terms of number of layers
and then even really deep as far as 150, 152 layers like I said about thousand as well for several
application, now what is the motivation when this network in terms of what is the observation
that they have had before they went to create this model, so the paper reports the following so the
trained deep networks on CIFAR-10 data so there is a 20 layer network so this is the training
error for the 20 layer network and the yellow is that training for the 56 layer network, similarly
for the test error so the test error for the so what is shown here two plots, so these plots
correspond in training and test error for network train with CIFAR-10 database.
So what are shown here are two plots the training and test error for networks strain with
CIFAR-10 database and if you see the one in the yellow is a one corresponds to the 20 layer
network the curve in the yellow and the red one corresponds to a 56 layer network this for
training similarly for the testing lower test correspond to 20 layer this one correspond the 56
layer. The problem with this picture is that general wisdom says that as you go deeper your error
training and tests error should improve because your representations are supposed to be better
and you will get better separability among classes so on and so forth because the nonlinearity
also increased as you go deeper however practically this seems to be a problem because
generally there this is does not seem to work that way as you go deeper there are issues with the
training and testing.
So now what is the reason behind this the reason behind so that there is a problem with the
gradient flow so the weights as you go deeper and deeper vanishing gradient problem is there
vanishing on exploding gradient problem is there and so that is not alleviated and so it is an
optimization problem basically and so as you go deeper optimizing a larger net deeper network
becomes harder, so to get over this problem they ResNet paper introduces something called skip
connections, so these skip connection are basically identity mapping so then since the network
just a replication, we will see what that means in the next slide, so they provide an alternate path
for the gradient to flow and make training possible and for the ImageNet challenge they managed
766
to train up to 152 layer network of course the number of parameter is also sometimes increase
with the kind of approach as you go deeper you have more layers.
So what is the principle behind which the residual networks work, so if by construction you can
do the following we know that the shallow network seems to work well so we use the weights of
the shallow network to construct a much deeper network and wherever we have gaps we just do
identity connections okay we have identity mappings so this is just by construction but of course
for the real networks constructed from scratch by backprop but so the general idea is that you
know if you have a deeper network its training and testing should not be higher than that of the
shallower network that is an intuition behind this so the way they approach this is to have skip
layer, so we just it is illustrated here so for instance these are two convolution layers that say it
takes as input a bunch of feature maps X and the output is the output we will denote as H (X) , so
in addition to the output what is also done in a ResNet is that the input is copied here to the
767
output, so we have here skipping two layers of two convolution layers and the input to this
particular convolution layer one is copied to the output of the convolution layer too.
So your output H (X) = F (X) + X , so how does this help in any way so instead of learning
H (X) you would learn the residual so basically you are trying to learn so the worst case scenario
if you don't learn if the weights are not updated at all it is everything is a small or zero value you
will at least learn X, so that is the idea so just at least or the gradient flow happens so that is and
the residual in this case this way, it is called learning residual. Residual in this case in most cases
would expect to be small the idea is that every layer only slightly perturbed your input so
whatever you have to learn is a very small perturbation of you input so your residual will be
much smaller so this is the principle behind doing the skip connections. The authors also show in
their paper there are presentations available on land which show that the gradient updates step is
additive and it is not multiplying so that it already in step leads to actual updates to your weights
and so it does not there no vanishing gradient problem that is also explain with the authors, so for
the ResNet used in the ImageNet challenge there is no explanation given in the paper but
typically by skipping as we saw skipping two layers at a time the skip connection are two layers
at a time you are able to get pretty good results or state of the art result as less than 4 percent
error rate on the ImageNet database using the residual networks.
So the networks itself using the residual networks the network itself varied from 30 odd layers to
152 layers so basically you would alternate this again very similar to VGG they used only three
by three convolution across all the layers and you would subsampling was done by either max
pooling layer at the beginning or at the end and by just by striding convolutions and for very
deep networks they also introduced the bottleneck concept in some cases so you would have a so
in this you would have for some of the networks they add bottleneck wherein they had a 1 × 1
you have at a 3 × 3 and followed by a 1 × 1 so that was for every instead of everything class
three layer you would have this one so a three cross three layer would be replaced by this kind of
a bottleneck module for some of the deeper networks so that the number of computations kept
sustainable.
768
So I say as I mentioned earlier this showed less than four percent error rate on the ImageNet
challenge and this skipping layers the skip layer connection is used now in many in most modern
implementation of all CNN just to clarify it is actually an addition so you would take feature
maps as input from as I say in this case you take the input from here are see feature maps and
you take them and add them to the output of these 3 × 3 additions this particular convolution
now it is possible that the output convolutions here the number of feature map here are different
let us say this s C1 feature maps and typically C1 greater than C so you would have to choose C
feature maps to which to ass so that is again a heuristic left to the person making a neural
network and also you can of course you can ask can we skip more can you skip three or four the
ResNet people who developed ResNet seem to think that two layers at a time work best, it seems
like a heuristic at this point.
(Refer Time Slide: 10:39)
769
So far this is summary of what have seen, so far LeNet-5 we saw with basic one of the earlier
networks which I mentioned earlier had a sequence of convolution pooling layers just still
followed today, AlexNet of error rate of imagine it is 15.3 percent VGG16 at 7.3 percent error
again all of them are ensemble, not one network ensemble of results a ResNet had top 5 percent
of 3.6 percent like it is better than human raters, of course inception Google inception has better
one than better than VGG or the same order of magnitude this is a number of parameters they are
very high number of parameters these two are comparable but of course this ResNet first
much-much deeper you see AlexNet had 60 million parameters for seven layers for ResNet what
50 layers or 152 a very-very deep network and many are multiple times the number of layers in
AlexNet so that leave 50 to 152, let us still had only 65 million parameters a very deep network
but with compare number of parameters to relatively I mean comparatively shallow network in
AlexNet so this is the progression, so far in terms of the results on the image net challenge so we
will look at one more network called DenseNets in the subsequent video.
770
Dense Nets – Densely Connected Convolutional Neural Networks
Part 5
In this video we will look at DenseNet one of more recent architecture was used in the ImageNet
classification challenge and has shown exceptional performance in terms of classification
accuracy despite having a fewer number of parameters just as we saw with ResNet as the CNNs
become deep becomes harder to train because the gradients begin to vanish. This particular
problem was addressed by ResNet by adding feature maps from the previous layers when
skipping a layer and adding them to the next layer but in general the key observation that this
paper makes is that by creating short paths from the early layers that is the layers closer to the
input to the later layers these are basically layers closer to the output the gradient propagation is
improved and so is the classification and in fact we contain very deep network more than
hundreds layers by adopting this particular trick.
Now what this instance do is architecture it improves gradient propagation by connecting all
layers directly with each other so suppose we have capital L number of layers, so typical
771
network with a layers there will be L connections that is connection between the layers however
in a DenseNet there will be about L(L + 1)/2 connections we will see what we mean by that in
the next slide.
Here is a particular incarnation of your DenseNet, so the inputs here there will be K0 let us say
input maps again for an RGB image that like the one used in let us say the ImageNet challenge
that will be about three channels now the first layer creates a feature maps in this case K is for it
is a feature maps but as you can notice as we go deeper into the network if we go to the second
set of layers he takes as input not only from the previous layer but also the input layer so that is
right there and then as we go to the next layer the particular layer here takes as input both the
place all the preceding layers feature map of the preceding layers however the output of each of
the layer is fixed so in this case there are about four feature maps of course we see that there are
about five each map in the succeeding layers that is typically fixed and the other thing that we
notice is that as we go deeper into the network this becomes kind of unsustainable, so let us say
we have about ten layers then the tenth layer will take as input all the feature maps from the
preceding nine layers, now that if each of these layers let us say produce 128 or 256 feature maps
and these is a feature map explosion,
772
So to overcome this problem of course what we saw, they fix the number of output maps from
each of the layers and also created these so called dense blocks as we see here in this red or blue
outline. So each dense block contain a pre specified number of layers inside them and among
those layers the feature maps are shared like we discussed before and the output from particular
dense block is given to what is called a transition which uses like a bottleneck concept like we
saw with ResNet and inception. We use a 1 × 1 convolution followed by a max pooling to
reduce the size of the feature maps so this serves two purposes because now we can do max
pooling so the transition layer allows for pooling which typically leads to a reduction in the size
of the feature maps now if you did not have this dense block kind of structure then max pooling
would not have been possible because the size of the feature maps across max pooling would be
less and it will be difficult to concatenate the feature maps across layers.
The following advantages are proposed by the authors as far as DenseNets are concerned use
parameter efficiency, so because you have fixed number of output feature maps per layer only
very few kernels are learned per layer for example about 12 kernels this is one of the
architectures they have suggested and then other architectures they are suggested 24 or 32
kernels per layer. They also talked about implicit deep supervision and feature reuse, so what is
implicit feature deep supervision, so for instance we saw an inception in that they used auxiliary
773
cost function using feature maps from the intermediate layers what that does improves the
learning in the sense that it has the if we just learnt have to be discriminative so as to improve the
auxiliary cost function there have been several other approaches like that for us and there is one
approach wherein you take feature maps from the intermediate layers, and give it to an SVM as
input and it does the classification task and then that error is back propagated. However, here in
this case as the feature maps are concatenated from the preceding layers the feature maps and the
activations from the earlier layers have a direct access to the error function or the cost function of
course because we saw that these layers are grouped into dense blocks as they call them, so they
are separated from the error function by a couple of dense blocks but they still have the feature
maps or the activations have direct access to the error function thereby improving training as
well as learning discriminative features.
So there are a few other terms that the paper talks about and which are the important concept as
far as this DenseNets are concerned which have this is summarized briefly in this slide. So the
growth rate this determines a number of feature maps output by into individual layers inside a
dense block so in this case we saw that here we see about K = 5 for instance. Dense
connectivity by dense connectivity we mean that within a dense block each layer gets as input
feature maps from the previous layer where as seen in this figure shown in this figure and there
774
are transition layers which transition layers aggregate the future maps from a dense block and
reduce it is dimensions.
So max pooling is enabled so is 1 × 1 convolution, composites function so the sequence of

operation inside a layer goes as follows so you have batch normalization followed by an
application of ReLU and then a convolution that will be one convolution layer. These are the
operations that are done in a convolution layer, so all these four concepts are basically the ones
that underlay a DenseNet architecture.
In general let us some basic details here so each layer outputs K if it is enough we saw that is the
growth factor and as far the convolutions are concerned they also use this bottleneck concept
which we saw in ResNet as well as in inception so this is basically a 1 × 1 convolution followed
by 3 × 3 convolution. In general every 1 × 1 convolution outputs about 4K feature which are
operated by the 3 × 3 convolutions and before the input goes to a dense block there is an initial
conv layer which outputs about 4K feature maps and these 4K feature maps are then used as
input to the first dense block and so on so forth.
Typically every network that they have designed for the ImageNet challenge as well as other
CIFAR database etcetera typically has about three to five dense blocks and with a growth factor
775
ranging from 24 to 32 and so on, as for the ImageNet challenge the initial layer outputs about 2K
feature maps.
So in this slide we look at this is a table which have summarized we look at one of the
architectures but the DenseNet-121. It has 121 layer that they use for the ImageNet challenge
here the growth factor is 32 equal to 32 so initial convolution gives rise to 112 × 112 feature
maps followed by max pooling which about 56 × 56 and then there are about four dense blocks
which define so the first dense block here defines 6, 1 × 1 convolution followed by 3 × 3
convolution that is 12 convolution layers the second dense defines about 12, 1 × 1 followed by
3 × 3 convolutions and 24, 1 × 1 followed by 3 × 3 in dense block three similarly for the dense
block 4 you have 16 of these 1 × 1 followed by 3 × 3 so if you add these up so we will get
some 116 of them and then we have three transition layers so that will be 119 there an initial
776
convolution layer about 120 and then the classification raised about 121 so that is why we have
the DenseNet 121. We will look at the details of each of these layers in the next slide.
So for the DenseNet here the images shown is that of a cardiac sequence but the DenseNet
challenge used RGB images from the ImageNet database. So if you look at the ImageNet
challenge network had about 4 dense blocks that so one we saw followed by and there are in
three intermediary transition blocks as they called and there is an average global average pooling
block when which connect to a thousand dimensional output. So what summarized here in this
picture so let us look at each of these blocks so let us look at the first one TB1 which is the
transition block one or what we call as initial conv layer so the input image is 224 × 224 × 3
which is a typical standard crop used in my most algorithms in ImageNet challenge now these
are then subject to a 7 × 7 convolution with stride of two giving rise to 64 feature maps that is
777
2K because K = 32 for this particular architecture so K is the growth factor, the growth factor
is 32 so you get about 64 feature maps for this particular network convolution then followed by
batchnorm and ReLU and then a max pooling with the stride of 2 which give rise to a
56 × 56 × 64 feature map so 64 channels of size 56 × 56 so now we look at this is the input to
the first dense block here this is a layer block.
So now the first layer right here we were shown right here the growth factor again is 32 as we
saw earlier feed the first do batch norm followed by and ReLU which still retain the size of the
feature maps to 64, 56 × 56 and then we have 1 × 1 convolution which we resize to 4K feature
maps which means that we have 128 feature maps of the same size these are convolutions
preserve the size once again pre activation batch norm and relu and followed by a in this case it
is this typo here it is 3 × 3 feature map which gives rise to K feature or 32 features, so this is the
output of the one of the convolution layers in dense block one the output of this after
concatenation with the input will give rise to 96 feature maps because that is 32 feature maps
here and the input is 64 feature maps so we add those two gives rise to 96 feature maps. So this
is far the one of the layers in the first dense block if you look at a transition block which is about
is right say right here this is one of the transition blocks it receives as input about 256 layers we
can go through math calculations and verify that it is indeed 256 so it takes these 256 layers and
again batchnorm, ReLU followed by 1 × 1 convolutions which gives rise to 128 feature maps
subsequently resulting in an output of 128, 28 × 28 because we do a 2 × 2 average with the
stride of to reducing the size of the feature map so this following dense block one after going
through all the dense blocks when it approaches the average pooling block you have feature
maps 1024 feature maps of size 7 × 7 do global average pooling with the stride of two and then
of course we get, it is full connected to a thousand dimensional activation followed by soft max.
So this is the typical is one of the DenseNet architecture used for ImageNet challenge it has
comparable performance to ResNet and other large network architectures that we have seen in
the past but with a reduced number of parameters so for instance the one of the top performing
DenseNet architectures had about 0.8 million parameters, so 800,000 parameters which is
sometimes with order of three or four times less than some of the levels larger networks so with
778
reduced network we want but this is kind of counterintuitive because you are concatenating
instead of adding like in ResNet you are concatenating features to subsequent layers however
there are no new filters defined in every layer you control the number of filters in every layer by
using a growth factor and we are using a small enough growth factor you will only define very
few number of filters and subsequently a fewer number of parameters that has to be estimated.
So for a hundred layer in this case we saw a 121 layer network the number of parameters is of
the order of hundreds of thousand now this has another benefit in the sense that it does not over
fit so typically some for the large network you have hundreds of millions of parameters tendency
to over fit and unless data augmentation and regularization is done so this kind of implicit kind
of regularization it is also referred to as feature reuse because you are concatenating feature from
earlier layers and using them as the filter kernels on top of them so that is another advantage and
reduce the number of parameters also help so this architecture is now we will see how this
architecture can be further used as fully convolutional network for semantic segmentation in the
form of both encoder decoder network or UNETs and see how their performance compares to
other deep architectures thank you.
Let us look at one of the layers inside the first dense block just right here so it receives us input
56 × 56 × 64 feature map from the initial conv layer which he called TB1 here there is a batch
norm and relu layer followed by 1 × 1 convolution which gives rise to 4K feature maps which is
about 128 because K is 32 in this case and then of course again followed by a batch Norm and
ReLU and in this case there is a mistake here this is 3 × 3 convolution to produce K feature
maps these K feature maps are all concatenated with the input feature maps so these are we saw
that about 64 of them and there is 32 output so we will get about 96 feature maps which are
given as input to the subsequent conv layer inside a dense block now as you progress through
each of these layers in the end we will have about so we have about one two three four five six
each of them producing 32 feature maps plus your input feature maps of size 64 which will give
rise to 256 feature maps which is the input to the transition block, the transition block of course
does a sequence of 1 × 1 convolution followed by 2 × 2 average pooling to give you 128 feature
779
maps so this if you walk through this similarly for every other dense block and of course the
transition block right here then we will end up with about 1024 feature maps of size 7 × 7 .
780
Train Network for Image Classification
Hello and welcome back. So last week, we were looking at convolutional neural networks, what
are the operations in nodal networks and we also looked at some of the operations, the
convolutions how they are done and some of the more common architectures so on and so forth.
So this week we will start off with how you will code such a thing like CNN in MATLAB right?
So there this is based on this example script provided by MATLAB. I will just walk you through
some of the pieces of the code just to give you a head start and also show you in case you don’t
understand something, how to look it up and you know how to proceed.
So it is fairly easy to walk through and create one for yourself. So this particular script that we
are looking at, it tells you how to create a CNN architecture for MNIST digit classification okay.
781
So the script itself is available. I will upload the script if it is not. The link to the script or just the
script itself so that you can look at it. So we will start off. So this is again a CNN for MNIST
digit classification. So let us look at the data first.
782
So we have to load the data. So MATLAB gives you something called the image Datastore
object. Okay. You can create one, the object itself. So what it has basically is that it does not
actually contain the data directly. So let us say you have like 10,000 images, so you can and you
do not have the hardware to load all of them. So it is not practical to load 10,000 images, you
may run out of memory very quickly if you do not have that kind of hardware. Support this
image Datastore does is that it stores the links to the images.
So in the sense like it tells you where to find the images. It shows the image directory, the
filenames of the images that we are using or the data files that you are using or some
meta-information about the images. So in this case, the size of the image, the number of channels
in the image, so on and so forth. So you can create a Datastore object. So this piece of the code
here that I have highlighted right now, it tells you how to create the Datastore object. So the first
line of the code digit dataset path and if you look at it, it is fullfile and we have all this MATLAB
root and all these folder names basically.
It creates a path, the pathname. That is all it does. So if you want to see what it is, so let me just,
you can run section by section. So I will just run this one section. Right? So it is running on the
cloud, I guess it is kind of slow. Right, so it is done. So if you look at the digit dataset path, okay
I will just, if you want to know what task, you just type it down here. So that is it. It has basically
the pathname to the directory containing the digits data, okay. So you can actually put it on your
783
file browser and see the data usage you want to. So that is why, that is what this command does.
Okay. Fullfile, it creates the pathname. Right?
And the imageDatastore function, okay, it creates this variable imds which has all the
meta-information about the data, right? So what does it do? So you give it the path to that
directory which has the data and you ask it to include subfolders, okay? So basically your
organisation would be, you will have a directory where you have folder directly if you are using
Linux where you have all the Datastores. Okay. So in this case let us say you have a directory
called digits and inside you have subdirectories which are labelled 0, 1,2, 3, 4 up to 9, right? So
there are subdirectories. 0 which has all the images of digits 0.
Subdirectory 1 which has all the images of digits 1 so on and so forth. So it would ask you to
look at subdirectories also, so include subfolders too. And label source is folder names. Okay. So
imagine data store can also store the labels. So what does that mean? Label source folder names.
So you have to name your folders so that it is actually a label for that image. So you know that
this is supervised learning. So for every X, you need a label Y. So where do we get that label
information from? You can name your directories, subdirectories so that or the subfolders so that
the name is reflective of the class of images stored in that subdirectory. So it take out the, it takes
up the label data from the folder names themselves, okay. So every subfolder has to be named
intelligently for you to do that. Okay? Then you can just, these are just objects that you can look
at.
784
So for instance, you can do imds. Okay. So it tells you what it has. Okay. There are all the labels
for each one of them. Okay. And it has the filenames. So you see that the image Datastore has the
following property. It has all the filenames. So it has the entire path, so look at this, there is a
folder 0 and inside folder 0, are all these images. Okay? Each one of them is an image of the digit
0. And there will be a folder 1 so on and so forth. Similarly, it has a label for each one of those
images. Okay. So there are about you know 10,000, there are 10,000 images, just 10,000 labels.
Okay?
So 997 plus 9997 plus 3. So this is what these 2 instructions are completing, the commands are
completing in the MATLAB environment.
785
So you need the Deeplearning toolbox for this. So you just want to look at the data, right? So you
just want to make sure that you have the data. So the number of images is 10,000 and then you
have a random permutation of 20 of them, okay? Choose a random permutation of 20 of them
and for each one of them, you show you display the images. So that is a very simple enough, so
make sure you run section. See how long it takes. So it just displays a different set of 20 images
in a panel. So your subplant just creates 4 rows and 5 columns of image display panels, okay.
786
These are small images because each of them are of size 28 by 28. So it comes up very nicely in
that panel. Okay.
So the next task we have to do like we usually do is to split these data into training like here,
validation and testing. Okay. So there is a function which does that, a method which does that. It
splits each label and you have to give it as input the image Datastore. Okay. Because the
Datastore itself has the label information in it. So for every image that you have that is a
reference in a much Datastore, there is a corresponding label, it knows that. So we will say and
787
we split it in proportion, right? So 0.7, 0.2, 0.1, you see if they add up to 1, so 70% of the data
would be training, 20% of the data would be validation, 10% of the data would be testing, okay?
And randomise means it will just select 70% of the data at random. So there are 10,000 images, it
will select 70,000 at random. If we leave this out, it will just choose the first 7000 images and the
next 2000 images and the next 1000 images. So randomise is better because if you want to train,
you would like to have a random sample of your data, okay. So if you want to know what these
functions do, the easiest way to look them up so if you are going through this code and some of it
appears opaque and you do not know what is going on, easiest way is to just do doc and you do
split.
Sometimes if you use tab, it works but it does not in this case. So doc, if we do that, then
MATLAB helps take you that particular function or method that you are using whatever. So what
gives you different versions of it. Right? There are several ways you can use this. It is all listed
here. I will not go through them but the one version we saw is this one which has randomised in
it. okay. So you just you can use this to split your training. So this is important because whenever
you are doing supervised training, you do need to have a training validation and testing split and
this lets you do that.
And image Datastore is a very convenient way of doing it because you are actually not
manipulating the images directly. Like you do not have them on memory, you just have the link
to the images in the form of filenames, metadata about the images, their label information, so on
and so forth and this lets you split, okay. So this is, this is how you see if you do not understand
any of the functions in there, you can actually do just doc that function. In fact, but if you want to
use doc, you have to know the exact function name or the method name or whatever that is that
you are using or the construct, you have to know that exactly.
Otherwise you just try help, sometimes that works too. But in case you are going through a piece
of code which they have provided, this is very easy, just to doc whatever and it will pop up the
corresponding method, okay.
788
789
So that is what split each label does. So it splits the data into, so if you want, if you are curious as
to what it has, so you can just come here and type this at the prompt imdstrain, sorry I have to run
this before I do that. So I will just run this. So run section. So it is done. Okay. So it will tell you,
so 6997 more plus 3. So that is like 7000 data points in the training. So if you look at the testing
data, that is because we have set 70% here. You can change this around and this convinces us
that it works that way. And so for instance, test lets you do imdstest.
So it lets you see, there are 10% of 10,000 images, you have a 1000 images for your testing.
Okay. So we have 3 new Datastores, imds, train, validation and testing. So then the next step is to
define the CNN architecture. Again this is an important step. This is very nice because it actually
does most of the layers that you so every layer is defined by the function and it has its own
arguments or method and its own arguments. Okay. Then you just know what they are. So the
image input layer is 28 by 28, right?
That is the size of the MNIST data and it has one channel, right? You can think of it that way. So
image input layer. 2d convolutional layer, the kernel size of 5 by 5 20 sets filters, you have
relulayer, you know what relu is. And the max pooling layer, 2 by 2 max pooling with a stride of
2. You have a fully connected layer, 10 neurons in a fully connected layer. And this feeds into a
790
softmax layer followed by a classification layer. So what does the softmax layer? You know
what it converts these raw numbers into properties, you can think of it that way.
And the classification layer actually calculates the cross entropy, binary cross entropy cost
function. So if you do not know what these things do, that is, again, like I said, so for instance
doc, let us do the classification layer, right? Doc classification layer.
791
So if you do that it will show you exactly what it does. So it is a layer that computes the cross
entropy loss for multiclass classification. So we are trying to do digit classification, 0 to 9, so
there are 10 classes. And with mutually and if you, I will just, I am trying to make this full-screen
because then I have to, it will come up, ya, for mutually exclusive classes, right. So then it
actually infers the output size of the previous layer. So we saw that we have, so if we go back to
the window, we have a fully connected layer of 10 which is then converted into a soft max.
So it knows that there are 10 soft max layers corresponding to the 10 classes and it calculates the
binary cross entropy cost function. That is what this layer does. So if you want to do so let us say
doc convolution2d layer, so this again tells you what that particular function does and what the
inputs can be. So if you see that filter size, number of filters, right? So you can come back here,
you see the filter size is 5, number of filters is 20. Right? So you can do the same thing for relu
layer, max pooling 2d layer, so on and so forth. So if you run this particular section of the code,
you can do run section, it will run it. So it is done.
So then you can look at the layers that you have created. So again it will just list you basically
this lists the architecture, you have seen the architecture, right? Even input layer, convolution for
the relu, so the relu is actually treated as a layer. But typically you treat them as 1 but in most of
792
the programming environment, this is treated as separate layers and we have max pooling, fully
connected layer, soft max, classification output, right? So this outputs the probabilities.
So now if we go down, so we have to set the parameters for training that network. So we have
that network. So we have to train the network. Again this is passed on like in the form of again
another variable called options. Okay. And you can call this function, training options, do it. So
we are using stochastic gradient descent, the maximum number of epochs, we want to run 20
epochs through, in fact 1 epoch is when you have used when you have done backprop using all
793
your data. You have initial learning rate. Verbose is false meaning that it will not have too many
outputs.
This validation data give it the data store, the data is validation data. Plots is the training progress
plots will show up, okay. We will run it and we will see what this comes up like. Again if you
want to explore about the different function, again do doc training options, it will show what can
be done. Okay. I will run it. I did not run this one. So this one is run section. So that works okay.
So here and now we are all set now. We want to train the network so there is a function, train
network.
Takes the training data as input, right? The layers, this is architecture and the options for your
optimisation of data. So that is what it does. So we will train it, let us see how fast this runs. I am
not sure. This is not exactly a fast computer, runs on the cloud. So we will see what pops up.
794
So this window will pop-up. Right? So you can see the number of iterations, it will stop at 1080, so
at each epoch, the number of epochs it shows sub here right? epoch 6. So we are doing 20 epochs
if you remember. This is the accuracy of classification versus number of epochs. This is the loss,
binary cross entropy loss that the classification layer computes. And we can see that 2 different
curves, right? The last is the training smooth loss and there is a validation loss. You see that they
are actually on top of each other.
795
Similarly here, the validation accuracy and the training accuracy, you can see that there too. So it is
running quite fast, we should be done. The 20th epoch, that is the final, ya. So it is pretty good. So
we are almost close to 99, more than 99%, you can see what happens. It use the CPU okay, it
works very well on the CPU. So that is fine. The hardware information comes up here. So now if
you are done, then we can close this window because this just pops up, just tells you how the
training progresses, okay. So it is a very nice information that you can get.
796
So once training is done, the network has the trained rates, net has the trained rates. So now you
want to see the performance on the testing data so then if you have a network, how do you do the
forward pass if you have new data? Right? so that is this function classify, so run here. So it takes
all the testing data, runs it through the network and gives you the output in the prediction, this is
the predicted output and the ytest is basically, it tells you the labels for the data in the test data
store, right? So this is the ground truth, this is your protection. Okay. So let us just to run this one.
Section, this should be fairly straightforward. So it is done. Hopefully yes.
797
Then you can look at the accuracy. Okay. And you can look at the confusion matrix, both of them
using these 3 lines of code. So we will just run that and just check whether this works out fine.
Okay. So the accuracy is okay. The sum of Y predicted equal to equal to Y test. So this is the
logical test. Okay. This is not equal to sign. So it just considers all the predictions that are equal to
your ground truth. So it says that you have one. So you add them up. So the number of Y
predictions divided by the total number of test data points, that will give you the accuracy.
The confusion chart, again is a function with the test label and predictions, it gives you the
confusion matrix, okay. So this is the predicted class as opposed to the true class. And you know
and if it is off, its main diagonal, then those are the false predictions okay. So that is what we, we
are looking at in the confusion matrix. Typical confusion matrix, it will be nice if it is a diagonal.
Typically it is not. So we have some errors in our predictions but it is a pretty good performance
like that.
Okay. So this is just a very simple example of how we can do, we can create a CNN for, these are
small images for MNIST classification. You can adopt the same strategy, of course each of the, the
important piece of code, I mean all of them are important in order to make the network. So you
just have to, in order to create the architecture you just create this layers variable which has every
798
conceivable kind of layer possible, okay. So if you want to clear a deep network, then you just
have to put in those commands there and MATLAB does the rest for you.
And you should also be able to configure your optimisation with the right algorithm, with the right
choices and if you go wondering what the choices are, you just do doc training options. I will
show you what are all available and you can choose the corresponding algorithm. We looked at
some of the more commonly used optimisation algorithms, they are all enabled in MATLAB most
of them should be, so it should be fine. Okay. So this is just a, like I said, simple example. So in
the later weeks, there will be one application week, maybe we will look at a slightly more
complicated architecture, will look at what they are like.
We may try to code them in MATLAB but demonstration will be difficult because of the size of
the data. Another point to note is that you look at how the data was organised, okay. So they have
a root, MATLAB, a main directory which has all the folders which has the data stored in
subdirectories and each subdirectory is named according to its class and inside each subdirectory
is the actual data, okay. So when you are trying to solve a new problem, it is up to you. So this is a
canned problem, MATLAB has done all the work for you.
But if you have some new data and you are trying to let us say run a CNN or for that matter, any
network through it, or any network on that data, to train the network using that data then it is your
responsibility to create this create these folders. Okay. So you have to curate the data yourself, that
is an essential part of you know coding Deep Learning algorithms. Make sure that the data is
clean, nicely organised because once you have them organised in this fashion, then you can exploit
the existing methods and functions that are in MATLAB, right?
Nothing prevents you from writing your own data store, writing your own code to load one image
or 100s of images in a time, at a time and training your network with it. I mean you can do that too
but as it is very convenient if you use the MATLAB Datastore. And for that, you have to organise
your data in this fashion, right? So make sure that your subfolders or subdirectories are named
according to your classes that you have for supervised training and then you have and then make
799
sure that all the subfolders are in one fixed place from which you all from which MATLAB can
upload the data sets. Right?
And of course you have to curate them, separate the data and all that. So that is, that is a lot of hard
work, most of the time, that takes up a lot of your time okay. So this is how we start off this week.
We will be looking at transfer learning, again hyper parameter optimization, we will also look at
one application in how CNNs can be used for medical image analysis you know, brain tumour
segmentation specifically. Okay. So thank you.
800
Semantic Segmentation
Hello and welcome back. In the series of lectures, we will look at semantic segmentation and
fully convolutional networks that are used for accomplishing this task.
So let us consider this image here. It is an image of a building. So if you have a typical look if
you have if you consider the neural networks that we have looked at so far, especially those that
do well in image recognition challenge and if a building is one of the classes, what the network
will typically do is to just identify this image as a building or for that matter, a lawn let us say a
skyline for distance. However in many tasks, what is required is actually a pixel wise labelling of
the scene.
So if you see on the right here, here is the result of a semantic segmentation network shown here.
So it has classified each of the pixels into a corresponding category. So for instance, this region
here, it is a building. This region all the pixels in this region have been classified as a building,
the skyline is appropriately classified as skies and we have the vegetation here and here and the
801
fencing is also been classified accurately. Of course, there are some errors where it mistakes the
shrubs for the fencing also. But that is to be expected in some networks.
So this is the typical task that we would like to accomplish in semantic segmentation. That is
basically classify each pixel as belonging to a particular category. Then of course there are many
nuances here. So for instance, if you look there is just one building. Let us say there are a series
of buildings separated by some space and in some applications, it is actually desirable to label
each of those buildings and also recognise them as new individual buildings. So it is called
instance segmentation, a slightly more fine grained task than what we are going to look at right
now.
The applications we are going to look at, the networks are going to look at, are only segment
objects of a particular type. So if we say, if there are multiple buildings, they will all be
designated as buildings. If there are multiple objects of the same type, they will just be
designated as objects okay of the same object. So for instance, just to reiterate, we want to
classify every pixel as belonging to a particular object and that should be the output of the
network. Okay.
802
So how would you accomplish this with let us say some of the networks that we have seen so
far? It is regular conventional CNN that we have looked at so far in these lectures. So for
instance take this input, the same scene of a building surrounded by a lounge. So what we
typically do is take patches of certain sizes, let us say 64 × 64 patches and give them as input to
the convolutional network to classify it as belonging to one of N classes. So for instance, we
might train a neural network to identify sky, vegetation, buildings and objects, general objects or
person and we train the network as such.
So for instance, the ImageNet challenge at 1000 category. So we can use a similar network, train
the network similarly to identify any let us say 4 or 5 categories depending on the kind of scenes
you are looking at. So for instance, images of parking lots, we just want to classify driveways
and cars and maybe signs. Okay. So the idea is to take patches from the image, pass it through
the network, let us say a typical CNN and output the particular class, the probability of the
particular class, or the score for the particular class. So that way we will accumulate.
So this score is for the central pixel of that patch. So that is what is shown here. So that, so we
would slide the patch, centre the patch around every pixel in image and classify that particular
pixel as belonging to a particular category. So as you can see, this can lead to a lot of
computational overheads because typical images will be of the order of 200, 200 to 500, 500 to
million pixels. So we will have to do 1 million for, pass it through the network in order to label
every one of the pixels. Of course, you can subsample the image and do it but that will lead to
errors in the semantic segmentation.
So the issues as stated earlier, it is very inefficient in terms of computations. And there is no
feature sharing among the overlapping patches, in the sense that the network only looks at that
particular patch and it does not pay attention to what is around that. So ideally to make it more
effective, you end up taking larger and larger patches, it is of course between that, you into doing
more computations and the results will take a longer time to be output.
803
So what would be ideally want? So that is why we are now slowly moving into this fully
convolutional, we will see what that is in a few moments. What will be ideally what? We want
an input image, let us say of size H × W with 3 channels, this is a typical RGB image and we
would like to send the image through a series of convolutions and have an output layer which has
as many feature maps as there are objects that we want to detect and each of those feature maps
correspond to a score which gives the probability that a particular pixel in that map corresponds
to a particular object.
So in the end, we take the argmax across the feature map and assign labels to every pixel in the
input image. This is typically what we would like to accomplish because this is more, this will be
more efficient. And another advantage would be since if you make it convolutional, like we see
here, we have left out Max pooling layers and all that and as it is convolutional, we can give
inputs of any size. Okay that is another advantage. So we do not have to be arrested to a fixed
size image. So if you look at the method we explode in the previous slides, it is basically we
have a CNN which takes as input a particular size patch and then it classifies the centre pixel.
Here, there will be no such restriction, that is typically what we would want and this is the aim
for the fully convolutional network.
804
Now if we look at the, so let us go just a little bit more detailed as to how these things work,
okay. So what we want is we have an input image, a typical classification network is what we
want to use for semantic segmentation. So we would take patches from the input image and feed
it to the CNN. Again this is AlexNet shown here. Alternative convolutional max pooling layers.
However what we do see here at this layer here, the 3 × 3 pooling layer followed by a fully
connected layer, FC is fully connected layer. Here what happens is here after Max pooling, its
size is, size would be 256 × 7 × 7 .
That will be the size of the volume after Max pooling. Now this is fully connected to the 4096
neuron. So in order to do that, we have to rasterize it. So 256 × 7 × 7 is a rasterized into one
vector and it is fully connected to the next layer which has 4096 outputs. So this is what prevents
us from having an arbitrary sized input. Okay. Because if we have let us say a slightly larger
input than this will be difficult to do because the number of weights here are fixed. Right?
So how many ways are there? So we calculate 256 × 7 × 7 times 4096. So this weight matrix, is
the size of the weight matrix, the number of elements in the weight matrix. Right? Now if we
have let's say a larger image, let us say 400 × 400 , then what happens here is that this feature
map size would increase, okay this number would be much larger. The size of the feature map
will increase. Number of channels would be the same, the size of the feature map should increase
805
but then it is not possible to propagate beyond this point because the number of weights are fixed
here. Okay.
So this is a restriction that comes in with whatever networks we have looked at so far is that the
size of the input has to be fixed. So we cannot use arbitrary input sizes. So we, in order to have a
very flexible semantic segmentation framework, we would like to have a network which does not
have this limitation.
So what we do is to convert these fully connected layers into what are called fully convolutional
layers? So how do we go about doing that? So let us just take this typical block of input coming
into a fully connected network, right? So this 7 × 7 feature maps about 512 of them and we
rasterize this and fully connected to 4096 neurons which are in turn are again connected to 4096
neurons. Then we have let us say this was like AlexNet we will have this will be connected to
1000 output classes. Soft Max, 1000 way soft Max.
So how do we convert this into a fully convolutional network? So we would not accomplish that,
we would just define filter kernels that covers the entire input region. Right? So for instance, we
can define 512 filters or excuse me, so in order to convert this particular layer into a fully
convolutional layer, what we would do is define 7 × 7 filters, of course the depth of the filter is
806
512. We will define 4096 of them. Okay. So this will lead to 4096, 1 × 1 feature maps. And
then in turn, we would again define 4096, 1 × 1 filters whose depth is again 4096 and we will
get for those 96 outputs. Okay. So this is how one would convert a fully connected network to a
fully convolutional.
That is, to just before the fully connected layers, you define filters which should cover the entire
input region. And then the fully connected regions themselves we would convert them into 1 × 1
convolutional layers. So now the network is fully convolutional. So what advantage does it
produce? So let us say for this particular network, we have an input that is coming in, which has
22 × 22 , this particular layer gets a larger input let us say because your input size is different,
22 × 22 and 512 channels.
So you can see that since we have defined 7 × 7 filters, we get a 16 × 16 output. Again the
number of channels is fixed 4096. And again by doing 1 × 1 convolutional, you get 16 × 16
output feature maps, okay. So what that is accomplished now we have to see how we can use this
for semantic segmentation. So what do these feature maps represent and in fact do they represent
anything at all and how we would get to the point where we can semantic segmentation?
807
So let us see how see the difference between image recognition and after fully convolutional,
fully convolutionalization, what does the output mean? Okay. So for an image recognition
network shown right here, you would have as input an image and in this case what you are
shown is a sequence of convolutional layers and these are the fully connected ones right here.
4096, 4096 and 1000 output classes. So these are the fully connected layer. So you would get
1000 output probabilities of scores and you would assign the label corresponding to the highest
probability.
So in this case, the highest probability corresponds to cats, so you will call it a cat. Now let us
say we have an image, in this case we have now a fully convolutionalized nodal network like we
saw earlier. So instead of going from 256 to the fully connected layer, we define filters which
spans the area of the input volume. Okay. So then once you have done that, then as we, we have
a slightly larger image as shown here as input. Then we would get, again for 4096 feature maps
of a given size and each of these we can again do a convolutional layer to give rise to 1000
feature maps, 1000 output maps.
And each and if we look at one particular map there, so let us so these are feature maps, these 2
feature maps, 4096 feature maps of particular size depending on your convolutional kernel and
we have this 1000. So these 1000 output scores. Now previously in the conventional CNN, these
were just 1000 what, single vector of size 1000 but in this case, there will be 1000 what you call
activation maps or score maps. Each of these 1000 outputs is itself a, like a feature map or a map
which contains the probabilities assigned to every pixel.
So if we take a slice of this 1000th and look at it, this is what is shown here, it will be like in the
image wherein the values should the probability of every pixel being a cat. Okay. So this is how
you, so we are converting from an image recognition, fully connected image recognition network
to a fully convolutional network. The output scores themselves are like images with the pixel
values of those images being the probability of that particular class.
So the N th map will correspond to cat and if you pull that out and look at it and inspect the
values, it will show you that each of these would correspond to, the probability of that pixel
808
correspond being a cat okay. So now we have, for an input image we have a slightly, so if you
look, this image is smaller than the input image. So we have a slightly coarser subsampled
version of your probability scores. Okay. So this is the problem that we will address.
So we know that because of the subsequence layers of, sequence of layers of subsampling
present in a typical network, the output score map will not be the same size as the input image.
So it will be a subsampled version. So the probabilities that are produced are also coarse. So of
course the most logical thing to do would be to up sample this by interpolation. So you can up
sample this to get the same size as the input with every pixel showing the probability of the class.
So how do we actually accomplish that, this subsampling? That is the question, right? So we
have once again we have this typically image recognition network and by turning the fully
connected layers into convolutional layers, we obtain 1000 score maps corresponding to let us
say, they are dealing with ImageNet data, so it is okay to let us consider 1000 category. So 1000
score maps, but then these maps are coarse. What do I mean by course is that the size of the it is
much smaller than the size of the input image, right?
So what we want to do ideally then is to up sample this to get to input size. So the simplest way
as I mentioned earlier is to do bilinear interpolation. You do not have to learn it. In fact you can
809
have a learnable bilinear interpolation. But that is not very efficient because in the process of
downsampling throughout this network, you have lost some information. So one of the earliest
papers, one of the earliest papers to do fully convolutional networks for semantic segmentation,
what they did was to pick out feature maps from the intermediate layers and add that to the
output.
So how is that accomplished? So for instance let us consider the feature maps from this 3 × 3 ,
okay. So they are of a particular resolution and you can workout exact numbers on your own but
just for the sake of illustration. We can also pull out feature maps from this particular layer, they
are of a particular resolution. Right? So, and then we have the output from here which again is of
a particular resolution, let us say. Right? So the idea is to upsample each of them.
The idea is to opt sample each of them and add, okay. So what that accomplished is that the
resolution have lost by going deep into network, we are trying to reclaim. So how does this? So
but then we will have multiple feature maps in the output of a 3 × 3 pooling operation. So the
way to do that is you would do 1 × 1 doing 1 × 1 convolution to get one feature map and then
you would do upsample all of them and add. This upsampling can be learnt. Okay.
So this is what we saw as transpose convolutions in earlier lectures, that is what it enables.
Means by up sampling feature maps from the earlier layers following a max pooling and adding
it to the output layer again we still have to, this the output feature map, the 1000 output score
maps are actually quite coarse, we still have to upsample them a bit in order to match the
resolutions of the feature maps from the earlier layers. But we add them and produce one score
and we do a pixel wise cost function.
So your loss function would be summation over the pixels in output score maps and summation
over the number of elements in, number of samples in a mini batch and then of course back
propagation is the same as earlier. Since we are using transpose convolution, again it is just like
the operation is inversed but still the back propagation of the algorithm will work seamlessly.
810
So was one of the earlier techniques that was used but following that, other technique that was
proposed and it is now very popular, it is very similar to what we saw earlier, by pooling feature
map from the earlier layers which is again we can call them skip connections or shortcut
connections, okay and that too seem to work. However, in the recent past, the most common
architecture is the encoder decoder network. So how this works is basically we have an encoder,
series of convolutions and max pooling layers and after we need just specific size of feature map,
we would once again do a decoder using again a series of convolutions and max unpooling as it
is called.
If we do that to get a feature map of the output scores would be the same size as the input. Okay.
So the idea behind the encoding layer is to extract the most relevant features from your input and
the decoding uses that to produce a pixel wise labelling of your input. So this encoder decoder
structure is what is typically used for semantic segmentation. The advantage here is that as you
can see, there are these cross, we look at what these are but we have these connections which
take input from the earlier layers in the network and then pass them on to the corresponding
layers in the decoder.
So you can see that there is a symmetry between the encoder and decoder. We will have similar
sequence of convolution and max pooling in the decoder as well as the encoder side. You can say
811
that the decoder side is the transpose of the encoder. Okay. So here, in this particular network,
this uses a technique called max unpooling which eliminates the need for upsampling. Okay,
need for learning to upsample. So you saw in the previous slides, the technique will look at was
it has to learn to up sample the final output score, score maps as well as to up sample the feature
maps of the earlier layers in network so as to match the input.So there is some learning required,
here they get over it by doing something called max pooling.
So how does max unpooling work? So here let us say this is a feature map. This is just a drawing
example. Size 4 × 4 . So if we do a 2 × 2 Max pooling of size 2, 2 × 2 then in every block here,
the red block here gives you 9, the green block here gives you 8, here it is 15 and 10. So those
are recorded and this is your pooled feature map. However while doing that, we also keep track
of the indices. So if you restore it so for instance 9 would be if you 1, 1 means what this is 1, 1,
0, 0 in each of the sub blocks. Right?
So for instance, here this is 1, 1 in this 2 × 2 matrix. This is 0, 0 in this 2 × 2 matrix. And 10 is
0, 1 in this 2 × 2 matrix. And 1, and if we say 15, 15 is 1, 1 in this 2 × 2 matrix. Okay. So if we
keep track of this index, now we come to the unpooling layer. So where is the unpooling layer?
Unpooling layer is on the decoder side where the feature maps are of the same size. When you
812
come to the unpooling layer side, then you would make this larger feature map array and assign
the values to the corresponding locations. Okay.
So for instance, you will have a sequence of convolutions on the decoder side. Following the
convolution, you will have a max unpooling. So you will take the output of the convolution, so is
the output of the convolution and assign it to a larger array, to an array with corresponding to the
indices shown here. Okay. So by keeping track of the Max pooling indexes, you should be able
to construct this matrix by assigning the values. Here just for the sake of illustration, I have just
taken images, taken the values from a 4 × 4 and assigned it to another 4 × 4 , we have to do
something similar but just by but we have to keep track of the indices here, okay.
So now once the max unpooling is done, in this case by assigning the values to the larger matrix
as indicated, we can then proceed to do convolution. The convolutions are done as in a usual
conventional neural network and of course the network learns to fill in the activations
appropriately..
And other architecture which has again found wide applications in biomedical image
segmentation tasks is called the UNet. So this network is again very similar, so you have in this
case the encoder and decoder side here. And we have this skip connections between the encoder
813
and decoder. This particular network again does transpose convolutions, so it does learn the up
sampling. But it does go deeper than most networks and because of the skip connections, the
generally the outputs are of a much higher quality.
So let us start with lets a typical input of size 512 × 512 , here I have shown an image of cardiac
MRI slice, size 512 × 512 . So you would produce 64 feature maps, a succession of 2
convolutions produce 64 feature maps with padded convolution so that the size is retained. Here
throughout this network, similar to VGG, we do 3 × 3 convolutions okay. So the red indicates
max pooling. So your 512 × 512 will become 256 × 256 , number of channels we mean to say.
Once again you will do a sequence of 3 × 3 convolutions to produce again similar to VGG you
have two 128 feature maps. The initial layers, you have 64, we have 128. And then followed by
Max pooling, okay. And then 3 × 3 , again we will do 3 × 3 convolutions to provide 256 feature
maps. This has a max pooling layer there which gives you 64 size feature map. And then once
again to convolution to produce 512 feature maps, to successive convolutions and then followed
by a max pooling layer.
So you can go across this network, I have just shown you on one side, on the encoder side. So on
the decoder side, again you would do, you would exactly duplicate this operation but then
instead of convolution, you will do instead of the max pooling, you will do the up sampling. So
here, it is max pooling on this side, on the encoder side you will have you will do up sampling.
Again the convolutions are again the same as 3 × 3 convolutions and you would systematically
reduce the number of feature maps as you go towards the output side.
Now in between, you have, you would copy in between for instance as shown here, these grey
lines show that you would take these feature maps from the encoder side and concatenate them to
the decoder side. Again here, there are multiple options. You can concatenate, here it is shown as
a concatenation or you can do a ResNet like processing where you can just add them. That is also
possible. And if the number of feature maps begins to difficult to handle, then you can just do a
1 × 1 convolution, reduce them and add or concatenate to the encoder side.
814
So as you go up the encoder, one of the things I have noticed is that you would take feature maps
from the decoder side at the same resolution and add it to the, sorry you take feature maps at the
same resolution from the encoder side and add it to the decoder side. This is very similar to what
we saw with the fully convolutional networks, the first fully convolutional network that we saw
is to we take networks from feature maps from earlier in the earlier layers and add them to the
output. So this is done more systematically here.
So again at every layer in the decoder side, you would go and find a corresponding feature map
of the same size and of the same resolution and add it, this case you can do a ResNet like a
residual connection or you can just concatenate, 1 × 1 convolutions can also be done to shrink
the size of the feature maps so that you do not have a feature map explosion on the decoder side.
And finally, you would produce an output with a requisite number of classes.
So you would do convolutions so that the number of channels for instance if you are looking at 5
classes, the output will have 5 channels of output with each channel corresponding to a particular
category and the pixel values in that channel representing the probability of that particular class
at that pixel value. So this is the very often use network for biomedical segmentation task. We
will look at an application that this can be used, later on in the next few weeks.
815
Of course we have also seen DenseNet. So what we saw, we can do similar processing with
DenseNets. So instead of having these VGG convolution blocks, all you have to do is replace the
convolution blocks with a dense block and instead of Max pooling, we have the transition blocks
we saw, as we had seen earlier when we were discussing DenseNet. We used the transition down
blocks to bring down the resolution on the encoder size. Similarly, we have transition up blocks
on the decoder size to improve the resolution.
So this has the same thing like we do deconvolutions or transpose convolutions as they are
called. Deconvolution is apparently a wrong nomenclature because deconvolution is
well-defined in electrical engineering. So transpose convolutions are a more appropriate
terminology to be used. So the only difference between this network and the one we saw earlier
in U-NET is that the VGG type convolution blocks are replaced by dense blocks. So we can, so
this particular architecture which is basically the encoder and decoder, the idea is we can use any
of the network topologies we saw, architecture that we saw in the which were used for the
ImageNet channels. But of course the only thing we have to do, on the decoder side we have to
formulate the appropriate channels or the architecture.
So for instance, this is just to illustrate again, if you go back to the U-NET architecture, you see
that this is basically this a this is nothing but your, this particular whole thing is nothing but your
816
VGG type block. Okay. So we have 64, two 64, two 128 here and then we have two 256 and then
512 and so on so forth. Of course VGG had three 512s blocks. But we can have a VGG or
AlexNet on the encoder side and then we will just have to flip that on the decoder side. Of
course, there is no guarantee that all of the network techniques should work. This particular
architecture actually works. We know that for sure, they are being published and applied to many
image processing tasks.
In the next series of lectures we will consider transfer learning. Again, we will see how we can
use this pretrained the networks to and apply them for related tasks and how that will in the case
there is not enough data, so and also followed by that we will also look at hyper parameter
optimisation.
817
Doctor Ganapathy Krishnamurthy
Hyper parameter Optimization
In this video we will look at hyper parameter optimization which is one of the important
aspects of training a deep neural network.
So in general the performance of the network depends on how well you train it. And one
aspect of the training is to fix various hyper parameters involved in constructing a deep
818
neural network, okay. So this would involve parameters like number of hidden layers,
number of units per layers or number of filters per player if you can think of it that way.
What kind of activation functions they use? If you are using some kind of dropout
regularization, how much do you use that? What kind of regularizer to use L1 or L2? And
what is the strength of the regularizer? What learning rates I would typically do? The weight
decay? And of course initialization we will look at this as a separate topic. We will not touch
this here but we will look at how to do weight initialization later on. In the later video.
Wrong setting. In this case the wrong setting is, I would say it is more like not an optimal
setting, it could actually affect the performance of a network okay. In many ways, in the
sense, it would converge to point where it’s not giving you a very high accuracy in terms of
the classification or the last function does not decrease beyond a certain point.
Or another case is, very slow convergence if you don’t have optimal values, so this is a
portion, this comment is part of the network optimization is what makes the training of a
network very hard, okay. So in the process if you are doing a much simpler network, let’s say
a couple of layers or 2 to 3 layers in general that is easier to optimize you can do that by hand
per se.
But as the number of layers become large and size of the dataset grows, in general it
becomes very difficult to determine what exactly you have to choose as these hyper
parameters.
819
So typically there are 2 ways, methods that are recommended, say it is more like heuristic
than method. You look at one of them it is called Grid Search. The idea is to train the
network with each possible combination of a hyper parameter, okay. And needless to say as
your number of parameters increases then your, the computations and all will become much
higher.
Remember that if you have a slightly deeper or larger network then you actually have to train
the network with every choice of these parameters, okay. And of course you would have a
hold out dataset or called validation data set. Validation dataset which you would use to see
which parameter setting is gives you the best results in terms of you know, depending on
your task that the neutral network was set to accomplish.
Now this we way of doing it, this hyperparameter search GridSearch in a very systematic
way. So in this case if you see if the columns are hyper parameter 2 then for a fixed value of
hyper parameter one you would change hyper parameter 2 and that will be the only variable
and that’s one way of doing it and then you would fix that and change the hyper parameter
one.
So one way is you would hold all other, it’s greedy search paradigms where you hold, you
need to randomly initialize all the parameters that you are trying to optimize and just very one
systematically see which gives the best validation result and you take that value of that hyper
820
parameter, keep that fixed and then go to the next one change that and so on and so forth,
okay. So this kind of greedy search is possible in a Grid Search scenario.
However as the number of parameters increases then the number of times that you have to do
is experiment goes up exponentially, very quickly. So it’s very easy to see here, right? You
have 2 parameters, you have 6 values for every one of them, so that’s like you have to do 36
and if you have 3 or 4 or 5 and each of them has 6 combinations then you see that’s 6 raise to
5, right?
So it will be for 2 parameters you get, you have 5 then it is pretty much 6 raise to 5, okay. So
number of operation increases. Number of operation increases. Now another thing to look at
here is the scale of the grid search. So what we meant by grid search was, so if you have a
certain range of your parameters, most of the parameters except for a number of layers or the
number of activations per player or the number of filter kernels or continuous.
So for instance learning rates or some weight decay parameter, so that you have to discretize
in a, let's say very meaningful fashion. So for instance if you are looking at let’s say learning
rates just for the sake of argument then you can set the learning rate to 0.01 and then you can
set it to 0.015 so on and so forth but then if you see these are of the same order of magnitude
on the same scale.
So one trick is to set all of these in a log scale, okay. That’s one way of splitting the grid. So
this grid search method typically works very well for smaller networks where you don’t have
as many parameters to optimize and it’s also cheaper to do these computations in a brute
force manner, okay. So as your search space increases exponentially as because you have
more and more number of hyper parameters to optimize. The more the better method would
be to do Random search.
Here instead of training on all possible configuration, you just choose at random, so that is to
the same grid, you can make the similar grid and then you would try to sample, you know
different areas of the grid, okay. Just to get an idea of what kind of results you get for each of
these points that you have chosen. Now once among these 5 points, let’s say that I have
clicked.
821
Let’s say this one gives you the best result, right? What you can they do is, search much
closer to this point, see for a better option that is possible. One the way this computation
savings happen is that if you recall from the grid search argument, you have to hold all the
parameters the same. Let’s say you have 6 parameters, you keep constant 5 parameters and
you have to change 6 continuously depending on a grid discretization.
That means that you are repeating calculations for the same parameter value numerous times
that doesn’t happen in a random search. However it is important that you sample your search
space intelligently, so that you are able to cover most of the space in a reasonable fashion.
Another thing to look out for is that, when your best result corresponds to something at the
edge.
So this is the edge of your hyperparameter grid then it is also better to actually increase the
grid to the left or in that case if it is over here to the right and search a little bit more because
it is possible that there are better values of a settings of the hyper parameters available
beyond that value, okay. So the 2 important takeaways are that discretize your hyper
parameter values on a log scale.
And you construct a grid and this grid again can be made based on what kind of values you
would expect to be meaningful depending on the context of the problem or you can look at
some literature survey and see what typical values are used in network setup shown good
performance, so the ImageNet architectures are a pretty good pointers that way and most of
them do a random search in order to get to a point where they can initialize to the correct
hyper parameters.
822
So there is another way which is called as Hand Tuning which is not highly recommended
which is again very similar to the grid search but you don’t do it in a very systematic fashion.
We just do this to see if there is set trend, okay. So for example is given that, so if you have
an MLP and if you have the 50 neurons in all the layers and you get accuracy 65 percent and
then you double the number and you get an accuracy of 82 percent that’s a positive trend.
Then maybe you can just be a little bit more adventurous and trained 200 neutrons and the
accuracy of 84 percent, maybe that’s a diminishing returns there. So then maybe you see
okay this is, maybe then it doesn’t make sense to go beyond let’s say 100 or 150 neurons,
okay. So you can use this kind of strategy to figure out the limits of your grids, okay. See
where you can start from, so that you can get a reasonable result and then make a grid at that
time and either do a random or systematic grid search.
So there is one more advanced technique we won’t go into this in detail which makes use of a
machine learning framework. So you have a learning program or learning strategy to figure
out the optimal hyper parameters. Here people use by the Gaussian processes to figure out in
the space of hyper parameters what your validation accuracy looks like and what the accuracy
on a validation data looks like.
So that’s the function you are trying to fit, since that the loss of the validation performance or
the performance on the validation data as a function of the hyper parameters. So if you
823
estimate that function using Gaussian processes that you can find out the optimal value of the
hyper parameters which gives you the best possible result on the validation data.
824
Transfer Learning
In this video we will look at Transfer Learning. One of the strategies used for training deep
neural networks when the number of data points available for training for the particular task
that you are interested in is very low.
So, the idea here is to use what we can refer to as a Pre-Trained model. Okay, so for instance,
let us say you have some data wherein you are trying to figure out for a particular task, okay,
it’s not an ImageNet challenge. For instance, you are trying to determine a species of birds or
easiest thing you have data, you have a limited number of data to determine the make and
model of a car depending from the pictures. This is some hypothetical task but then you don’t
have enough data, maybe you have thousands of data points but that’s not sufficient for
training deep neural networks.
So, what you do is, you would take a network like AlexNet or VGG or Inception. We just
been trained on image ImageNet database. Do a forward pass through a pre-trained network,
so forward pass the data, right? And store the embedding. So if in the case of AlexNet or
VGG you have about 4096 neurons in the output layer before the classification layer. To take
825
this as a representation of your data. So now, your input data is now represented by 4096
length feature vector. That is the feature that represent the data.
And you use this as input to another machine learning framework let’s say an SVM support
vector machine or binary trees or maybe just another neural network that you can train with
this data and classify appropriately. Okay, so this is a strategy that works very well, that is
taking the embedding in this case this 4096-D vector would be the embedding that you get
from your CNN. So this strategy works very well if the task at hand is a data for the task at
hand or the task at hand is similar to something that ImageNet accomplishes. In this case
since we are talking about ImageNet, we will take that as the standard.
So, ImageNet networks, so the networks trained on ImageNet data. Pretty much it’s a
classification network or image recognition network trained on thousands of types of images,
okay. And you can safely assume that it is learned all kinds of possible features, okay.
Corresponding to the images from the while present in the ImageNet database.
Now, if your input data is very similar, in our case example I quoted that types of type and
make and model of the car or let’s say even type of species of birds or for instance species of
cats or dogs, okay. That data is kind of similar to ImageNet data. So if you have a network
trained ImageNet data maybe then you can use this 4096-D representation as a reduced
representation of your input data. And then use another machine learning paradigms to train it
in order to accomplish your task.
However in some cases, the data might not exactly be the same or maybe there is probably
have a slightly larger dataset. In that case what you can do is, you can take the network with
pre-trained weights, okay. Modify the classification layers from 1000 neurons to number of
classes in your new dataset that your task demands, and then train the network. So this
portion of training the network is basically training the network with whatever data you have
for your task. And this process is this training with your excess data is referred to as fine
tuning, okay.
And in general this is what people refer to as transfer learning, okay. So we start with a
network that has been trained on a very large database and most often the not that image
database is very similar or the task to be accomplished is very similar to the task that you
826
have now proposing to do with your current deep CNN. And then you use your limited
dataset to modify the weights in your network appropriately by using backprop. And this
typically works very well in many cases.
So the reason it works as I stated earlier is that, if you have a fairly large database which is
labelled as like the ImageNet database, okay. Considerable effort has gone into making that
so, and if you have a network trained on it, a couple of things. One is that, given the size of
the database and the depth of the network you have trained on. Safe to assume that the
network has learned all kinds of low-level features which are transferable to other tasks.
So here the deeper layers let’s say for some CNN trained on ImageNet, the deeper layers
learnt task specific features. So for instance features specific to certain breed of dog or cat.
These are easier to train as these are closer to the classification layer so error backpropagate
faster. The initial layers learn more generic low-level features like edges, blobs, some kind of
patterns in the picture. And generally difficult to train because of the errors being difficult to
backprop the errors from the output layer to the innermost layers.
So, if you have a network which is strained on a fairly large database on a task which is kind
of very similar closer to what your, at least related to what you are going to do, then it is safe
to assume that at least that the lower level features learned by that network are transferable to
827
your task, okay. So we just try to exploit that features learned by this already trained
pre-trained network for your task.
And since the deeper layers are easier to train compared to the innermost layers, the earlier
layers in the network. Fine tuning your pre-trained network with the data that you have will
hopefully modify your outermost layers and adapt it to the task at hand.
So in general when you try to do this transfer learning, one strategy is to freeze the initial
layers because as I mentioned earlier would expect that the low-level features are also
transferable to your data. In some cases, it’s possible that even the, you are using the
pre-trained network because you feel that, you can say that the weights have been initialized
to a pretty good value after training with that large database.
But then your task or your input data is totally not related to the network you are using. So for
example for the second case, let’s say you are trying to accomplish a radiologic task, we will
look at a case study later on. So you are trying to classify chest x-ray scans, right? Right, so
we’re looking at chest x-ray scans are being abnormal or normal, okay.
Now this data, chest x-ray scan data is not similar to ImageNet data at all, it’s quite
classically different. But then you it’s still like to exploit you know the deep network that has
been initialized with pre-trained weights. But then in this case you would have to necessarily
train from scratch. Okay, so but in that case we assumption is that you have enough data to do
828
so and this is another mode of transfer learning even though you were used the pre-trained
network as a good initialization of weights and you would still go ahead and train all the
weights in all the layers you see the data available to you.
829
So just to illustrate some of the points that we have talked about earlier. So ideally if you
have data which is let’s say similar to, in this case since I am looking at a this is AlexNet,
okay, we looked at this before. So your data is similar to ImageNet and then you have a
classification or recognition task, right? Then what you would do is you would just drop this
out, okay.
And then replace it by say in this case let’s say you want to look at 10 breeds of dogs instead
of thousands will have 10 output, and then trained or fine tune with whatever data that you
have for the task, okay. But then let’s say your input is radiology data, chest x-ray, so
dissimilar but large data. So you take the, of course you would still go ahead and modify the
output layer.
Let’s say you need only normal versus abnormal, so instead of 10 you will have just one
output, right? You can say 0 or 1 and you would train the entire network. This is assuming
that you have entirely different data but then you have enough of it. So because there are if
you know AlexNet has several million parameters, 60 million and if they want to train them
you need to have as many datasets of course you also do data augmentation in order to
increase the size of a dataset, artificially.
So, given that the data is dissimilar on the task of chest x-ray is a good example or medical
data in general, medical imaging data in general or some form of spectral data which is not
830
similar to ImageNet, then you would have to train the network from scratch but you can use
the pre-trained network on the pre-trained weights as a good initialization. Okay, so that’s
typically recommended.
So in any case if you have enough data let’s say you have hundreds of tens of millions of data
points let’s say images for instance for some particular task. One thing you can still do is to
keep the network architecture retain network architecture. This case I have shown network
but you feel that use inception or some other model that’s fine.
But you train from scratch. So that way you have you do have a structure a CNN architecture
that has been proven to work with large datasets image datasets. So you can use that same
architecture even try to see if some of the training hyper parameters can be applicable or not
and use your data to train it from scratch.
Only if in the case where you have dissimilar data but you don’t have enough of it, okay. so
the sense that it’s not millions but maybe of several thousands and then you can do data
augmentation may be of tens of thousands then you still, the idea would be to freeze some of
this layers, okay and also that’s to train the final layers let’s say we trained this few layers
using your data and hope for the best, okay. Assumption behind freezing these layers is that
at least you have believed that the local features the low-level features are transferable to your
task that you are interested in.
So all these strategies put together typically is referred to as transfer learning wherein you put
in simple terms you take a pre-trained network and see if you can just modify it slightly to
accomplish your task of interest. Okay, so the other scenario that I think I referred alluded to
early on is when your data is actually very similar to let's say in this case you are looking at
AlexNet trained on ImageNet data.
Maybe you have a very similar data since let’s say you’re looking at cat breeds or dog breeds
you want to identify different type of dog breeds based on just the picture, you can just then
as we discussed earlier just take this embedding out and put it through let’s say another neural
network and you have either 10 outputs hundred dog breeds let’s say or 10 dog breeds, okay.
Such a thing is possible, okay. So if it’s very similar then you don’t have to maybe even train
from scratch or even do fine tuning can take the embedding from the network, okay.
831
This is done in many cases so for instance if you are trying to do task like video
segmentation, video frame segmentation or trying to do object detection then many of the
imports look like the objects in your inputs look like objects in let’s say ImageNet, then you
can use ImageNet directly in such a, in the sense the networks trained on ImageNet directly
in such applications, okay. When you move across applications and you be careful about how
you’re going to accomplish transfer learning.
832
Segmentation of Brain Tumours from MRI Using Deep Learning
Hello and welcome back. In this video we will look at segmentation of brain tumours from
Magnetic Resonance Images, using some of the deep learning techniques, especially CNN's
that we have seen so far. So, brain tumours are Gliomas, so we typically refer to them as
brain tumours. They affect the central nervous system or they usually are in the brain. And
this is a serious illness, a form of cancer, which has a very poor prognosis, in the sense that
the survival is less than 2 years.
And the patients are typically monitored by magnetic resonance imaging, also referred to as
MRI, Magnetic Resonance or MR imaging MRI as it is known as. It is an I call imaging
technique most imaging techniques, it is non-invasive, non-ionising radiation is used.
Basically a magnetic field and you have RF excitation, that is what enables the meeting. It
has got very good spatial interpose, submillimetre's spatial resolution, in this case temporal
resolution in this case is not very important but spatial resolution is good.
So the idea behind imaging this patient is that by imaging them you can visualize the tumour
noninvasively and by looking at the tumour, measuring its size, one can, Doctor use that as a
833
marker for figuring out if the tumour is progressing or it is responding to medication. So the
segmentation or delineating the pixels corresponding to the tumour is an important task,
diagnostic task that way. It is typically done manually by an expert radiologist, however it
can be very time-consuming and there is some variability among radiologists on certain tasks.
And if you have a very large patient, you want to do like meta-analysis or anything, then
manual delineation is exactly not possible. So in order to augment a radiologists effort, it is
the deep learning program that can effectively segment the gliomas, can be very valuable to
me. So, typically not one image is acquired but volumes are acquired. So typically MR image
volumes, it is called MRI image volumes are acquired. We will not go into the details of the
acquisition of the physics behind the acquisition. It is the image volumes, when I say image
volumes, they are basically 3-D arrays.
So, each image is a 3-D array. So typically you would say 256 x 256 x 100, okay. So this is
the in-plane size and this is the depth, okay, so it acquires the, across the anatomy, okay. So
and it is in the form of slices, okay. So the in term, so as far as MR is concerned, MR imaging
is concerned, multiple image volumes are acquired and each image volume corresponds to
what is known as the sequence. This is a, each sequence corresponds to a certain technique or
a certain way of exciting the spins inside the human body, exciting the magnetic spins inside
the human body.
834
So, that each of them gives rise to a separate kind of grayscale contrast in the image. So if
you look here, we have shown 3 types of MR images, corresponding to the same anatomy. I
am just calling to the anatomy automatically and you can see that each of them, we do not
worry about the meaning of the thing right now but you can see that each of these images,
even after they pass through the same anatomy, they have a different grayscale contrast, okay.
So multiple different types of contrasts are possible using MR images.
So for a typical glioma imaging session, you will typically acquire about for such sequences
and each sequence will have a size of 256 × 256 × 100 , where 100 is the number of slices
through the anatomy. If you are wondering where that 100 terms, so let us say this is a typical
human head right here, okay. So brain is right here somewhere. So you would acquire slices
which cut through the brain, okay, or the head so that you cover the entire brain or the entire
head in this case.
So just to reiterate, every MR image is actually a volume, it is a 3-D array and you will
acquire about typically about 4 such 3-D arrays per patient for diagnosing gliomas. So,
typically a glioma this is in-line with some conventions that have been laid out. It is divided
into these 4 sub compartments, there is an edema, Necrosis, enhancing tumour and
non-enhancing tumour. We see the image below, if you see in this image, the tumour is
delineated this way, kind of, all right.
835
The green regions are basically what is known as edema which is some fluid or water
accumulation, okay. And this also tells you why we need 4 different sequences because
certain components of the tumour are seen much more clearly in certain sequences. So for
edema is seen very well in FLAIR and T2, the necrotic region where they accumulate, the
dead cells accumulate is seen very well in T1 post-contrast or T1C it is always referred to as
T1 C. Enhancing tumour, also which indicates breakdown of blood brain barrier is seen in T1
C, again seen in T1 C, so if we look here, the enhancing tumour is, this region is marked in
colour, okay.
The non-, the necrotic regions are once again marked by different colors here in another
sequence. Okay. Non-enhancing regions are those regions which are major of the 3. Okay,
and again there is some variability among radiologists at what these regions are. So the final
segmentation, akin to semantic segmentation, that we are looking for is given by this image
you see, you need about 4 classes of pixels that we want to delineate. Of course what is left
out here is a normal, so everywhere else here, all the pixels here are normal pictures that
correspond to different class completely.
Okay, the classes within the pixels are, the edema, enhancing tumour, necrosis and
non-enhancing tumour. So this is the task. So what does the data look like? So, this data is
part of the brain tumour segmentation challenge, which is conducted every year as part of the
medical imaging conference call MICCAI, conducted in the different cities in the world.
836
Very, it is one of the conferences for medical image analysis and this particular challenge has
been one of the more popular challenges, you get a lot of people entering it, trying to win the
challenge.
So its acronym is BRATS. So the dataset is publicly available, it is a multicentric dataset, so

to elaborate it briefly, it is multicentric because MR imaging is the grayscale values or the
contrast that you see in the values and some of the artifacts and shading that you get in the
images, vary from scanner to scanner and from hospital to hospital. So it is important to get
data from different scanners or different centres, like different hospitals, so that your network
generalises well to some new data from the different hospitals. And there are 2 types of
gliomas, low and high grade.
So the high-grade glioma is the more serious condition. Of course, typically more low grade
gliomas progress to high grade gliomas, I am not talking about diagnosis, just to show you,
that there are 2 types of them. And the tumours in them to look different. So it is important to
understand that there are different delineations required. And so, in the sense that I will say
they do look different. If you train a network on high-grade glioma, it is quite possibly not
going to work very well on the low-grade glioma, that is what I mean by different.
So, if you have to talk to a radiologist to get really, if you want to understand the
pathological, pathology difference between low-grade and high-grade glioma is anyway. So,
each patient volume consists of 4 different volumes are 4 different sequences, okay. So these
are all the flair sequence, fluid attenuated inversion recovery, T1, T2, these are the names of
the techniques used to acquire the volumes. And if we see on the panel below, you will see
that again each technique gives rise to a slightly different or in this case radically different
grayscale contrast that helps radiologists identify the pathology in the tumour.
In the image as well as the different types of pathologies. We will not go into these things,
which I mentioned here, which is each MR sequence is Skull striped. So, typically the skull is
also an image and generally interferes with a lot of processing you would typically try to
remove that. They are registered in the sense that not all patients do not have the same head
orientation during the acquisition of these different sequences. So there will be slight
differences in the orientation, so you correct for the post if you can call it that.
837
And they are all resampled to have isotropic resolution, in this case it is isotropic resolution,
okay. But 1 millimetre cube resolution, okay. That dimension of each dataset, each volume,
there are 4 volumes, each of size 240 x 240 x 155, okay. So typically you can slice this along
any axis, so since this is a 3-D, you can, we are looking at the 240 x 240 cross-sections
typically slice this. And the one 240 x 240 cross-section is what is referred to as the axial.
And then the 2 other perpendicular cross-sections are defined, they are referred to as coronal
and sagittal.
We will not discuss those further but you can look them up to get a better understanding. So,
typically you have these sequence of images per volume and you have 4 sets of volume. And
ground roots are given for these datasets, 210+75, the ground truth is corresponding to the
colour map, the segmentations we saw earlier, it is also shown right here. If you see, so each
of the tumour class, intra tumour class is marked by different colors. And the task is to obtain
a similar classification by using some machine learning techniques or other image processing
techniques.
So we will typically explore the CNN's that have been very successful at this task. Okay. So
the 1st CNN that will look at corresponding to this publication, this is one of the remaining
entries to the I think the 2015 competition. Here they have used the CNN trained on 2-D
patches, so if you recall from our discussion on semantic segmentation, one of the naive ways
838
of doing semantic segmentation is to extract patches from your images and label the centre
patch according to the image class.
So that is the typical strategy followed here too. So you have a training set here 4 sequences,
okay and you extract a patch, centred over a pixel. So you will extract the patch from all 4
sequences. So, you input these 4 channels with a patch corresponding to each of the 4
sequences. And you will only look at 2-D patches, not 3-D patches, we will look at really
patches later. And of course the ground truth corresponding to that, we have it and prior to
extracting the patches, there is a lot of pre-processing done, because these MR images are
from multiple centres.
So there is a histogram matching step just to make sure that the intensities in the images, so if
you consider the intensity of 100 or some anatomy in the brain, which has an intensity 100,
you want that to be 100 in all the images. So just to do that, you do some histogram matching
to match the distributions of the pixel values are across the image volume, across the dataset.
So, there is a patch extraction and pre-processing steps done and you train it with a CNN
correspondent to a loss function.
So this is a classification task, appropriate classification is used, okay. And then when you
went for during the testing phase, you use the trained CNN here to get your final output label.
Okay. So, here the labels, there are one of 5 labels, right, you would have background, label,
or, and you have the 4 classes inside the tumour, okay.
839
So, the network used is inspired by this VGG network. We have a succession of convolution
layers followed by Max pooling and as they go, so these 2 layers of 64 filters defined, these
layers of 128 filters defined, okay. And you have a Max pooling layer of stride of 2, and then
you are fully connected layers and you have output, here 5 class output, okay. So, this
particular group trained 2 networks, one for high-grade glioma and the other one for the
low-grade glioma. So it is actually another way around. This is the low-grade glioma network
and this is the high-grade glioma network.
You can see here is the high-grade glioma network has slightly more number of convolutions,
this case 3, convolution of 128 feature maps defined. And there are Max pooling layers are
supposed here, the number of convolution layers here is usually 2, it has 3 and it has 2. So,
there are 2 successive convolutions with 128 feature maps with 3 of them here. So they train
networks with data, corresponding to low-grade and high-grade with these 2 networks. And
they have the most competitive performance for the segmentation challenge, which was in
2015 is what I recall.
This is inspired by VGG, but the number of filters and the number of layers is much different,
okay. You can also understand that you cannot have a network as deep as the VGG network
because do not have the point, data points. And we are training, the training was done using
the patch that we saw earlier, using patches. So you can actually typically extract, since the
size of your input volume is 240 x 240 x 155. And you typically, this group extracted patches
840
of size 33 x 33. Of course there are 4 sequences, for your input has 4 channels, with patches
of size 33 x 33 extracted from each of the sequences.
You them together, that is your input. The output is to classify the centre pixel, in that patch
as belonging to one of the 5 categories that we saw earlier, okay. So the 5 categories being it
is a normal tissue, enhancing tumour, non-enhancing tumour, necrotic core and edema. So,
these 5 classes, these outputs and it is inspired, that was inspired by VGG. This is one of the
earliest CNN used for this particular task, That won the challenge again.
And as you can see, some takeaways here being, use the architecture, that is used in Image
Net, it is inspired by the image that VGG architecture. And it is patch based training, so as to
label the centre pixel. And of course this entire network was trained from scratch, no weights
were used from the VGG network and you also, they also have to do data augmentation by
flipping, rotations and translations.
841
Just to summarise, once again the idea is to use patches of size 33 × 33 , and are 2 different
networks, one for the lower grade as the higher grade glioma. And then network is trained to
classify the centre pixel of the patch. And following that, there is something called connected
component analysis, wherein you retain the largest connected component of pixels.
Connectivity is defined as 4 or 8 connectivity, so if you have a grid, let us say easy on 3 × 3
grid covers all the if, labels belonging to A class.
Right, so they are all connected, so you retain them. Let us say somewhere else in the image
you have one isolated pixel and you typically tend to ignore it, okay. So you would group the
pixels depending on how they are all connected and then remove those pictures that are, that
are not corrected in the sense that you will find the largest connected component this way,
based on this kind of connectivity. And actually remove those other groups, even though they
are connected, they are much smaller than this, so they are removed. So by retaining the
largest connected component, they were able to get a very good scores.
842
these are the typical segmentation that you get from the network. Another approach to this
problem is using a UNet, UNet architecture we saw earlier. So here we can actually predict
an entire slice. You try to predict an entire slice and entire patch in 1 pass. Rather than in the
previous technique that we saw, only the central pixel was appropriately classified as
belonging to one of the 5 classes. So, we can use a fully convolutional neural networks. So
that was one of the entries in the competition. Or the encoder decoder type network, which is
a UNet type, can be used to accomplish this task.
843
So let us see how that works. So, here you can give the entire slice as input. So we look that
the, again, once again the size is 240 × 240 and there are about 155 such slices, this is one
slice. So the previous network that we look that, each slice is rasterized, I mean, you have
taken a patch of size 33 × 33 you centre that patch on every pixel and try to predict every
centre pixel of every patch. That takes a lot of time, so you can use or using fully
convolutional neural networks predict the labels of the pixels in one forward pass through the
network. Okay.
So this is a typical UNet architecture that we have seen earlier. Okay. So it takes as input
240 × 240 and predicts the classes associated with it with all of the pixels. So it has encoder
and decoder type of architecture, this is seen through before, so this is the downsampling
layer and this is upsampling layers and there are these, we can call them shortcut connections
or skip connections from the downsampling layer to the upsampling layer in order to improve
your resolution. Okay. So, get the skip connections were used to come by the low and high
level features in a network.
So this was one of the entries in the competition and the citation is at the bottom of the page,
you can look them up for more information. Of course here, this might not be the most
incredibly efficient way of doing it, the better method would have been to predict, let us say a
smaller area within the network, okay. Because as you can see there is not enough
information, there are 2 things, 2 problems here, if you notice, we have mentioned earlier the
images are 3-D, so these are not two-dimensional images, these are three-dimensional
images.
And it is more meaningful to process them as such. So the slices, successive slices as you go
through the volume are correlated. So, it is good to exploit that correlation. And within a
slice, so if you look at one cross-section, one slice which is shown right here, there is not
much information about this class, the edges of the image, you are losing out because there
the neighbourhood information is missing because you are close to the edge of the image. So
it would be more meaningful just to predict, let us say point out some smaller area inside the
box, inside the image, even using a UNet, that is more meaningful.
So you can always take a crop and combine and use that in the skip connections, rather than
using the entire feature, the size of the feature map, that would have been more accurate that
844
way. And another thing to point out, again to point out that we are not considering the
volume, the image volume, rather than only looking at the cross sections individually. Okay.
845
So this is one of the results from that UNet segmentation. We have seen quite a few false
positives which as I mentioned earlier, if you do connected component analysis, wherein you
retain the largest cluster and you can get rid of the smaller ones, so that is a good analysis. On
top of that you can also do conditional random field, okay, but it is not done here. But if you
look at the prediction after post processing, it is typically connected components and
compares very well, qualitatively in this case compares very well with the ground truth
annotation.
In this video we will look at another architecture for brain tumour segmentation, it is called
the 2-D tiramisu with 103 layers. So, basically it is inspired by DenseNet architecture. It has
846
dense blocks, transition down and transition up blocks. The dense block has 3 layers, 3
convolutional layers and each layer has a composite operations consisting of batch norm, Rel
U, 3 convolution and the dropout layer. Okay. The transition down layer within your
subsampling feature maps, again the batch norms, RelU, nonlinearity, again the one cross one
convolution again will dropout, layer has been added here during training and followed by
Max pooling.
The transition up layer has 3 cross 3 transpose convolutions, the stride of 2. And we try to
and the network actually predicts the entire input in 1 pass through the network. So, Dense
block as you seen before, series of convolution layers and each player receives input from all
the previous layers. So, in order to prevent the feature, feature maps, too many feature maps
and it is feature map explosion, we control the growth factor to 4. The transition down layer
as we saw before is used to reduce the spatial dimension of the feature, using the
downsampling side in the network.
And the transition is the transpose convolution layer, used in the upsampling side of the
network, okay. So the typical architecture is given here, so you have 4 channels as input,
corresponding to the 4 different MR sequences, each of size 240 × 240 . And then of course
followed by, there is an initial involution, okay and which leads to 48 feature maps followed
847
by a dense block and a transition down block. Thereafter the transition down block you have
the feature map size is reduced to 120 × 120 .
So, you go on, you have several about 1, 2 and 3, 4 dense blocks gives rise to a 15 cross 15
feature maps, 464 of them, you have a bottleneck layer which does 1 cross 1 convolutions
and transition up. And then again, we go through the transition up layers which are again 1, 2,
3, 4 dense blocks embedding with intersperse of transition blocks to get an output of which
has a 3, in this case we are predicting 3 classes and you can do argmax across the classes to
get it 240 × 240 output. The reason why this is done, only 3 classes instead of the 5 we saw
earlier because of the classes were merged in this version of the BRATS segmentation
challenge, so we ended up predicting 3 classes.
848
The other architecture that we are going to look at, it is called Deep Medic from a group in
the UK. They exploited the 3-D nature of the input data. So they cut slices, all correlated
across the volume and of course since your volumes of size 240 cross 240 cross 155, it is not
possible to get 4 such volumes as input to the network. You will run out of memory
eventually since you remember that you have to have mini batches and all that during
training.
So the idea is to restrict the size of the patches so as to not have the memory issue, but at the
same time exploit the 3-D nature of the images. We just briefly look at the architecture used.
This was again the winning architecture 2 years in a row, I think 2016 and 2017. So, this
group used , citation is at the bottom, this group used multi pathways or 2 pathway network.
So one is, one pathway is supposed to give local features at high resolution and the other one
has global features at low resolution. So the local features are learned from 3-D cubes of size
25, so, 25 cube and 4 such channels.
The global features are learned from patches of size 51 but then resized to 19 cube. So, the
large green box is the size of the 51 cubes size, 51 size patch, then it is resampled to 19, okay
but it is 19 cross 19 cross 19 volumes from inside the image and there are 4 channels, we
have 4 such cubes. And then you have sequence of convolutions and then there are also in
between there are these ResNet residual layers like in the ResNet architecture. And then
finally, the lower the come on the lower pathway which is the global feature pathway is
upsampled and concatenated.
849
And then you have the usual convolution or fully connected layers which leads to, usual fully
convolutional layers which leads to an output of size 9 cubes. So the output size is 9 cubes.
So you are taking a very large context, which is typically 25 cube and you are predicting 9
cubes out of it. Of course, you have 5 classes, so you have 5 classes, 9 cubes output, each
giving the probability of that particular class. So this architecture was very efficient and it
won the challenge 2 years in a row. It incorporates several things, one is that we need 3-D
context, especially for medical images, that is very important because the images are
inherently 3-D, so it is good to exploit that.
It gives us 2 pathways, one pathway looks at a higher resolution but a small patch size. Other
pathway looks at a lower resolution or global features but at a bigger patch size. So, you
resize the 51 patch to 19 cube and do a sequence of convolutions, Max poolings, as well as
the skip residual connection to improve training. So, this particular architecture incorporates
and of course incorporates both those concepts that I mentioned earlier. And also that instead
of trying to predict the entire image in one pass, right, trying to predict only the smaller, of
course then, smaller volumes, but then you do have to raster through the network in order to
obtain.
So it is not one pass to get the entire volume, but multiple passes because you are only,
eventually predicting a 9 cubed. So, you are predicting a small subset of pixels inside your
patch that you have taken. But then you are considering information from a very large
neighbourhood of the patch. And that is easily handled by the network.
850
They also have the conditional random feel regulariser after the predictions. So if you look at
what these images show, the manual, this is the ground truth and this is the net output
predicted by the network. Okay. So the CRFs are used to regularise your output predictions.
So that is one of the other novelties in this particular submission. So a variation of that, not,
in the sense of using 3-D patches, the 3-D Tiramisu is used in the network. Building blocks
are very similar to the 2-D variant, however, patches are 3-D in nature.
Search in of using the entire 2-D volume as input, you take 64 cube patches as input. And you
try to predict all of them in one pass through the network. Okay. One of the challenges in this
brain tumour segmentation is that there is a huge class imbalance, which is finally, as a final
talking points in this application that we are looking at. So if we typically look at these
851
images, you will find out that you know some classes are wholly underrepresented. So, for
instance the non-enhancing tumour or the necrotic core comprised of a very small number of
pixels corresponding to the tumour itself.
Now if we consider the tumour itself, it occupies a very small while in the entire brain. So,
less than 5 percent of the tumour in the brain actually corresponds to the tumour, okay. So,
that is the class imbalance. So if you are trying to do a pixel wise classification, you see that
most pixels are normal, right. Every small percentage of the pixels actually comprise the
tumour itself. And within the tumour itself, there are these classes which are very
underrepresented. So, these considerations one has to take care of when you train these
networks.
So, for instance, you should make sure that you have cubes at the cubes your sampling or if
you are doing a patch-based approach where you are classifying centre pixel, then you should
sample or to more data augmentation for the underrepresented classes and train the network
accordingly, okay. So all the networks you have seen so far using area 3-D patches. We will
suffer from it in varying degrees. So the advantage with using 3-D patches is that the class
imbalance is slightly elevated because if you are looking at a slightly larger volume through
the tumour, then the volume will contain enough samples of every class, that is a general
observation that we can make.
852
So doing 3-D sampling, doing 3-D patches for medical image and you see in this problem, at
least the problem that we are looking at alleviates the class imbalance problem to some
extent. Of course you can always say, you can always sample so that you have the
underrepresented classes have more samples to match those classes that are overrepresented.
And typically we also see that many of the techniques use the connected components to
gather the largest connected component or they do a conditional random field approach to
clean up the segmentation.
Because the cleaning up requires, in the sense that there will always be false positives, some
pixels, some small clusters of pixels or group of pixels being labelled as tumours when they
are actually normal. So if you want to remove those, then you can either do a connect
component or CRF. CRF takes in the context, so that is a better way of doing it. But most of
the research groups a matter not trying to do this postprocessing, rather just use the output
from the network itself as a final result.
So, you can see the impact of the post processing here on this particular segmentation . So, if
you look at the post processing, so this is a post processed image, this particular small region
has been removed corresponding to one of the classes and it matches very well with the
ground truth right here. So post processing helps improve your score a bit. And when you are
doing this cube 64 cubes patches, the stride at which you sample also seems to help, overlap,
853
the stride with overlap in this case 32, especially found to be useful for classification,
classifying the boundary walls and the purchase.
So, segmentation will stride, seem to be more smoother than unstrided approach, so that is
something that we have observed in our experience. So, postprocessing is of course necessary
in many cases depending on how accurate your network is and the amount of false positives
that it generates., That is an important aspect of the processing pipeline. So that concludes our
case study, we were looking at using convolutional neural network for analysing medical
images. Specifically to segment brain tumours from MR volumes. Particularly challenging
problems since brain tumour is diffused with no clear-cut boundaries in many cases.
The problem also involves segmenting subclasses from the tumour which are again not that
well-defined in many cases. The size of the dataset is also a challenge, these are 3, inherently
3-D data and every patient or every data point is actually made up of 3 volumes, sorry 4
volumes. And how to extract patches from those volumes, whether 3-D or 2-D and train them
also to do the inference more efficiently, so that if you get a patient volume, you should be
able to do that in a reasonable time, that is also an important requirement, both accuracy and
efficiency in computation. So that concludes our session on CNN's, so we will move on to
other deep learning architectures in the subsequent lectures, thank you.
854
Activation Function
So, we will look at the sigmoid non-linear activation now, here is the ana-
lytical expression for this sigmoid non-linear activation, okay and this is the
plot of the function, you can see that, so this axis is the x-axis and y-axis
is σ(x), okay, so we can see that for large positive value of x the sigmoid
function tends to 1, and for large negative values of x again, it tends to 0.
Thing to notice here, it becomes flat for large positive values of x as well as
for large negative values of x.
So, in this slide,

P we use x but typically the input to the nonlinearity is
your z which is wi xi , okay, for there is a w0 , which is a bias term not
included in this summation but the z is what is typically used, so you can
also rewrite this as σ(z), this is just so that you do not get confused when
you see as using z in a later lecture sometime.
So, the input to the sigmoid function you might know already is the linear
combination of your input featuresP with the weights, okay, so what this says
is that if z is very large, i.e., wi xi , is very large than either magnitude
of that, so very large positive as well as very large negative numbers wixi
summation would lead to the sigmoid function being saturated.
855
Now, what does that mean? Which means that the gradient where in
this case you will just do ∂σ(z)
∂z
will be close to 0 very small number okay, so
this means that if you and of course attended the back propagation lecture
you will see that this leads to negligible or 0 updates to your weights, so the
learning basically code encode learning will stop, so the weights will not be
updated as frequently and they will not change as dynamically as you expect
them to be.
So, this is one of the problems with the sigmoid functions, so that, this is,
of course this has many other advantages, one of them is that suppose your
output layer you want interpreted as a probability score then this would be
the optimal thing to use because the outputs are between the values 0 and 1,
so this also known as squashing function because it squashes the output in the
range 0 to 1. Okay, the next function that you are going to look at is the tanh
hyperbolic which has very similar features except for the range of the output.
The tanh hyperbolic function is again one of the activation function that
are often use, it is one of the earliest functions to be used, in the here the
range of this function is between −1 and 1 and here to again saturation prob-
lems exits and it acts like identity in the sense, it is linear identity, linear near
origin unlike the sigmoid function, so again, this is also referred for certain
type problems and saturation problems here again meets that the gradient
with respect to the argument, input argument goes to 0 so, which means that
weights will not get updated.
856
This is again the most effective, the ReLu or the rectified linear unit as
it known as, it is one of the more effective squash function and it has been
used widely in computer vision problems, very successful in computer vision
problems and, so this is one of the preferred last functions for many of the
modern networks, architectures that around their, again, it is very simple,
that if for all values greater than 0, it just gives the same value, that is basi-
cally, it is identity for all values of x greater than or equal to 0, and for any
input less than or equal to 0, the output is 0.
Here, the problem is that very for very small values of the input than
the gradient will be the f of the function value will be very small, so that is
one of the drawbacks and for negative values again the gradient will go to 0
right, so this is the, because the f (x) = 0, so if x is negative the ∂f∂x(x) = 0,
so then updates will not be done to the weights, so this is here, the problem
with ReLu but otherwise this is a very effective cross functions, again, you
know that this is actually a non-linear function, in the sense that it exists
only for values of the input greater than 0 and it is 0 for all values of x less
than or equal to 0.
857
So, the Leaky ReLu was to again, was take care of some of the problems
that ReLu had, it said is that for, it give you a small gradient value for neg-
ative values of the input right, so in the previous slide we saw that for just
a plain ReLu or the rectified linear unit, if the input was negative than your
gradient also go 0 but this case, having it in this form, so for all negative
values of x, it will give you a scaled value of that input, so which means that
the gradient can exist okay.
In other version of this is the parametric rectifier linear unit, wherein

instead of having a fixed scaling factor for x, for negative values of x, we
have an alpha which is again a parameter which is learned during the back
propagation process, okay.
858
5
859
Yet another variation of this is a exponential linear unit which again for
negative values of x takes this form, this is to make sure that the average
activation in a lane goes to 0, so again this has a gradient for negative values
of x also okay, so these are the typical activation function that are used in
the network, you can for instance I have not, I think shown here, the actual
analytical form for tanh(x), you can look that up and convince yourself that
the plot looks this way.
So, you can see that the problems with sigmoid and tanh or basically the
saturation here, on the other hand, they are very convenient in the sense
that the output or squash between −1 to 1 or 0 to 1, so for sigmoid is 0 to 1,
so that again if you are let say, so you if you look at sigmoid, so let us say,
if you want output layer to be a probability, to be interpreted as probability
score then may be add the output layer that something like sigmoid function
would be an appropriate non-linearity, sigmoid, tanh and ReLu are the most
often used and the most effective activation functions that are used in all
flavours of artificial networks.
860
Learning Rate decay & Weight initialization
And the other important topic that we would like to look at before we go
into, how to initialise weights is the learning rate decay, again here the idea is
because the neural networks, it is a very complicated function of the weights,
it is very difficult to say when we are near optimal, minima and when we are
near some very poor minima, so that must be some systematic way of chang-
ing your way on the learning rate, this is basically your α that you are pretty
much used to that is, you must have seen x = x−α ∂L ∂x
, you are use to weights.
∂L
So, your w = w−α ∂w , right, so this α okay, so at this α in this many what
people typically do is to vary this α with every iteration or epoch, so what
you mean by epoch? Epoch is when you are gone through your entire data
set once. That is one epoch right, let say you are using stochastic gradient
descent or mini batch gradient descent. You have to run through our entire
dataset and that will be considered as one epoch, so from epoch to epoch
you can vary your learning rate.
This is important because that, this learning rate will dictate by how
much to change your parameter spiking, right because that the magnitude
of the update also depends on not only on the gradient of the loss function
861
with respect to the weights, but also on the learning rate α, so by modulat-
ing α you can also modulate the magnitude of your updates. Okay, so what
happens is that when you have very high learning rates right, so let say α is
very large number, let say close to 1, then your parameter values would vary
rapidly okay, they would vary rapidly and by large amounts and would not
settle down in local minima.
However lower learning rate would lead to slow learning which means that
they will not be, parameters will not change rapidly and it is quite possible
that they can stuck in, some false minima okay, so then this, there is no good
way to know when to do what but typically there are techniques that people
use is to decrease the learning rate as the number of iterations or epochs
increase.
So there are many ways to do that right, so one of them would be to

reduce the learning rate by a constant factor every epoch or every k epochs,
so there is spelling mistake here or every k epochs, so this is again decided
by you, you have to again, this is, consider this as another hyper parameter
that you have to optimise right, another way would be to check the valid,
so you have validation data and training data, so you check the performance
on the validation data and whenever the performance on the validation data
improves you decline, you decrease the learning rate by a certain fixed factor
okay, what that factor is? Of course has to be determined by trial an error
or some systematic search that we saw earlier.
862
Okay, one of the more automated ways also to ensure the very smooth
decay, so for instances we in the previous techniques we saw you manually,
not manually you set, you decrease the learning rate by a fixed amount, by
a fixed factor, every epoch, here this is more smooth way of decreasing the
learning rate:
α0
α=
1 + decay ∗ epoch
So, α0 is your initial learning rate, which you set again, this is another
hyper parameter that you have to figure out, divided by 1 plus there is some
decay rate times the epoch number, okay.
So, as the number of epoch increase depending on the magnitude of the

decay, your initial learning rate will also decrease okay, so this a very smooth
way of decreasing your learning rate, there are also exponential schemes for
decaying your learning rates, so that is also possible, many of these software
packages have these implemented as black boxes, you are welcome to use,
of course you should know what you doing, this is the reason why we are
explaining this here okay.
863
The next topic is weight initialization okay, so this is again a very im-
portant topic because if you do not initialize the weights properly, then you
will not get good convergence, you will see why that happens and then you
have to keep, you know trying out different starting points to train a network
again and again to before you finally hit on something, some initialization,
that actually works, okay.
So, why it is important because in a typical neural network there are large
number of weights. Okay, hundreds of, thousands of weights for very small
ANN going up to millions and millions of weights for large neural networks
right, so which leads to a large search spaces, so search spaces is because
we have to look, you to minimise your loss function and loss function is a
function of your weights, so it is a multidimensional problem that you are
solving and so if you want to optimize that loss function than its a very large
search space of weights.
So, you are looking for a combination of one hundred million weights or
more which will actually work and you want to affect, you want to weights
must be initialized randomly, to effectively randomly to because to prevent
convergence of false minima and also to explore your entire space, just as we
saw as to how we have to pick your range of hyper parameters for, you know
for hyper parameter optimization, we also have to pick up, we also have to
figure out the range of weights that we, for initialization and so that the
entire space of loss function is covered okay, to some extent at least and that
is required for the effective performance for neural network.
864
So, how is this typically done and what are the problems, your naive ini-
tialization would involve a sampling weights from a gaussian distribution or
uniform random distribution with zero mean and unit variance right and so
your distribution of weights for 1 weight let say will or bunch of weights will
look like this, so this is the w and this is probability of w. Okay, typically
this is a distribution from which you will pick right.
Now, we can do this but there are some problems associated with this and
will see what those problems are okay, assume that your input data x has
also been normalized, remember we normalized with z-score normalization
to have it zero mean and unit standard deviation and we are also picking
weights, so that they have zero mean and so your weights wi have zero mean
and unit variance, your input features xi also have zero mean and unit vari-
ance right.
This, you know, we also assume that the individual xi are independent, wi
are independent. Note that this might need not be, this did not be an indi-
vidual xi being independent need not be true and subsequently the wi also
need not be independent, especially in the case of images where there is
structure, but in most cases this is true, Okay. So, then what happens, so
let say we have 1 layer, first input layer as xi and we have let say M features
right.
So, your linear combination would give raise to M1

P
wi xi , right, so we
just write this out, so that will be this, let call this output y right or actually
I stick to the notation and right, so z is, will ignore the bias okay, w0 , z =
865
w1 x1 + w2 x2 + · · · + wM xM , M is number of features, let us take me about the
first layer right, so the variance of z, it turns out under these assumptions
that wi are independent, xi are independent, wi xi are independent of each
other, it turns out that and variance of z = M var(w)var(xi ), okay, that is
what happens okay, which is nothing but approximately, so in this case M ,
okay.
So, what does it means, so what is the implication? So the implication

is that if you randomly sampled w’s from a gaussian with zero mean and
unit standard deviation and you do and of course you have normalized some
features also to be that way, then your variance on your sum that goes into
the z remember is the input to your sigmoid function, let say, let us consider
sigmoid function just for sake of argument M is input to your sigmoid func-
tion, then what happens we saw that for large values of z which is possible
right, because the variance of z is M time something, so you can take a very
large values, so in this case, the sigmoid would saturate.
So, during back propagation the derivative would be zero and which
means that the rates will not get updated, so you are stuck okay, so this
is the problem with drawing from, so but of course people have been doing
this and so you will have to do many trials and errors, so that some point
will get one good combination which will give you do good back propagation
okay, so this is just looking at it from one input to a neuron right, so what we
have discussed is this is one neuron somewhere in the first layer all of these
weights, these are the xi and these are the wi leading into it, also recall that
there will be neuron’s going from outside of them also, okay.
So, if you have a, say in this case I have used the M features because we
are assuming that I am at the input layer and typically we would say instead
of M . I will say Nin which is basically the number of neuron, number of
weights that are feeding into, number of neurons that are feeding into an-
other neuron in the next layer okay, other terminology is Nout which is the
number of neurons or the weights emanating from a neuron in a layer okay,
so that is you have to keep track on that.
So, we have established that the variance of z, which is, z is a linear com-
bination of the inputs to a neuron time multiplied by the weights, so it is M
times the M , it is pretty much M , so it makes sense, if we scale our choice
of weights by M , so the variance should be, so we when try to sample, will
sample from distribution with zero mean and variance 1 over M , that makes
sense, so that is what we will do, okay.
866
So, the variance of w got from the previous layer argument would be got 1
over in this Nin , I call it. Okay, which is basically the number of neurons
that are feeding into a particular neuron, so from a previous layer okay, so
this is Nin , of course you know that the number of neurons also, number of
weights coming in from the neurons in the previous layer and of course from
each neuron you have inputs going to multiple neurons has output right, so
this number of output here we call it No ut, okay.
Now, we only looked at the forward pass, in the previous arguments we

have only looked at for pass what happens? Now it turns out that if you
consider the backward pass, that is basically when you are doing back propa-
gation and you want to preserve gradients in, so then you have to keep track
of when you have to consider this number also right, because the gradients
feed in to each of this neurons from Nout number of neurons right.
So, then what this xavier initializations does based on the author of the
paper who proposes initializations is to make sure the variance of w is:
2
w=
Nin + Nout
Okay, this is the solution, the often it turns out that it works very well
for a wide variety of problems okay, so in both this arguments the idea is
that when you have, it considered the weights and independent and the fea-
tures as independent, turns out that the variance of the linear combination of
867
them scales as the number of them, number of the weights or the number of
neurons and to take that into account you scale the variance from which you,
scale the variance from the distribution from which you sample the weights
by an appropriate factor, okay.
So these are commonly used in practiced, xavier initialisation is very com-

monly used in practice and so is this one, the 1 over Nin , is also commonly
used okay, so both of them give very good results for, you know convergence
of a network very quickly, so all of these how they help is that, they help fast
training okay, otherwise they will take for our to converge because you are
the saturation than the gradients become zero, because of the saturation and
finally back-prop does not update the rates is effectively and so the learning
it comes very low, by doing this investigation will have effective learning.
So, this is our lecture on hyper-parameter optimization, data normaliza-

tion, data scaling as it call as well as, weight initialization okay, if you have
question, please post them on the form.
868
Data Normalization
Hello and welcome back, so in this video we will look at data pre-processing
or data normalization techniques, before you start using them as a input to
some of the common machine learning algorithms okay.
So, what do I will describe now is most often used, the commonly used
technique and is called the z-score normalization, I just go through what is
869
actually done and then we can see why we actually have to do it. Okay, so
let say we have M data points, which is for a training data points, so this
is, so it has to be M training data and there are N dimensions, so which is
basically N dimensions are features, so incidentally these techniques are also
known as features scaling techniques, right.
So, what we typically do is that we make them all 0 mean, so which means
that if you have, let say we denote our data points by x, right and look at,
let say we are considering a features x1 right, so then we calculate the mean
for that feature µ1 , as follows:
M
X xi1
µ1 =
i=1
M
And then just subtract mean from the feature, i.e., xˆ1 = x1 − µ1 .
So we do that for every feature that is available in the dataset, so now

the all our data points are 0 centred or 0 mean, the next step that would
you typically do is to divide this, so I will denote this by x̂1 , so that just to
be different, divide the mean subtracted features by the standard deviation.
The variance is given by:
M
X x̂i
1
σ12 =
i=1
M
Square root of the variance, i.e., standard deviation (σ1 ) is what we want.
So, of course in the previous summation I left out the normalization by

the total number of data points, of course, so that is I am at the mean, so I
have to divide by M , divide by M , so this is the, since we have subtracted
the mean out this will work, so we take the square root of that, so it will be
just σ1 right, so we take σ1 and so we get x̂1 → σxˆ11 , okay.
So, that is the process, so you calculate the mean, so I left out that M
in the first part, calculate the mean for each feature independently over the
training data, not over the entire data set, now we have splitted data into
training, validation and testing, so you calculate the mean over the entire
training data, subtract the mean for each features from the individual data
points and divide by the standard deviation of each of the features, okay.
870
So, this is the general, the most often used data normalization technique,
this is of course assuming that your data is Gaussian distributed, which might
not be true most of the time, so what does this accomplish it, it brings your,
make sure your data points 0 mean and unit standard deviation or unit vari-
ance, right.
So, why is it essential? Why did you want to do that? So, before we go
that we just look at some what it looks like after the normalization, so I have
just, considering two features x1 and x2 , okay, and x1 range from 1 to 1000
and x2 ranges from 103 to 106 , so if you want solid example you can think
of this as the square feet or the area of some living space apartment or plot
of land, and this as price of that living space. Okay, very cheap by todays
standard, price in rupees or whatever denomination want to, currency want
to choose.
So, if you plot these two, so again x1 here and x2 on that axis for all 3
plots. That is what it looks like, since they are both drawn from random
normal distribution, does not look anything spectacular but so I just want
to show you what happens after the normalization, so if I subtract the mean
out from x1 and x2 , you can see that they are 0 centred okay, so and so the
zeros here kind of an, 0 centred but you can see that the x1 is in this case,
this is the span of x1 , if you can think of it as the you know the scale of x1
is in this direction, the scale of x2 , here.
So, if you look at it, if you actually have look at the numbers, this is in
the 1000 and this is of the order of 105 or 104 , so this height the range over
871
which the span is much higher than this range. Okay, so if we actually divide
by the standard deviation for each one of them, you can see that ranges are
also similar right, I think −2 to 2 for both the x1 and x2 parameters, okay,
this is after we divide by standard deviation.
So, why do we need to do this, so if you recall, let say if we are considering
a neural network, let say I am just going to look at two neurons just to show
you what will happen, let say the input is x1 and x2 right, so you have say like
that right, so the problem here is that, so we have terms like w1 x1 + w2 x2 ,
right this term will appear often right, so when you look at these kind of
terms, if you are not normalizing, you will see that you know x1 has, x1 is
much much smaller than x2 of the factor of 103 .
So, but during optimization then, in order to, in order for network to con-
verge then the weights have to be adjusted accordingly, so you would expect
that the weight multiplying the larger value would be much smaller, let say
by same order of magnitude, so it turns out that this kind of optimization
is much harder to do or typically will not converge or would be very slow to
converge, so because when you are inputting, you are saying that x2 , because
it is working with numbers you are saying that x2 is more important than x1
because it takes larger values.
Now, if you think that all your features are equally important than its
important to do this normalization because it brings them in the same range
of numerical values and it helps converging faster okay, optimization is easier
this way using gradient descent right, so because we have sums of this form,
wherein, you know, so because this will be the input to your sigmoid func-
tion or the nonlinearity, so then the larger term will tend to dominate all the
times, so to prevent that from happening, of course during optimization w2
would be adjusted so that x2 does not dominate, but of course that is a dif-
ficult problem for optimization.
If x1 , x2 are the same range than the problem is much easily solved. Okay,
so that is the reason why we do to this normalization, if all the data points
are in the same range, you need not sometimes do it, again, but then also
depends on what your output is like and all that, so, for instance, if you are
doing a conditional output, sorry probabilistic output, let say it is a classifi-
cation task, it makes sense to squash all your inputs in the range 0 to 1 or
within this or do this kind of normalization, same thing with regression okay,
so if you are inputs are bounded then you can have a bound on your output
also, so it make some sense to do this normalization for many classification
872
or regression tasks.
So, the other techniques which will look at, the something called the prin-
cipal component analysis where you can reduce the dimensionality of data
and also decorrelates the data, so, for instance, in the previous example I
said okay, you can think of the smaller feature as the square footage and
the larger variable as the price, obviously they have not there correlated
right, so there be a lot of such features in your data, which might be corre-
lated and the principal component analysis techniques helps decorrelate also.
Now, we will look at this at a later lecture, in the coming weeks where we
will talk about what PCA is and then I will show you that how we can use
for pre-processing data for machine learning algorithms and of course there is
873
other scaling we can do here, we did the z-score, wherein, we subtracted the
mean, divided by standard deviation, you can scale your data to lie between
-1 and 1 or between 0 to 1. Okay, depending on your application, this can be
done, these are also valid pre-processing techniques, I would not call them
normalization but pre-processing techniques that will help you, okay.
So, these are the, so what we have look at is the most fundamental or
the often use pre-processing technique and it is typically done for many, any
kinds of input you can do that for imaging inputs or any kind of real value
inputs, okay, so from here on we will look at, so this is again for the input
layer, so this is for the input layer okay, so this is your data, which is your,
this goes to as input to the first layer of your algorithm in this case, the deep
learning algorithm, so what happens in the intermediate layers right?
So, these kind of techniques applicable there to and how do you go about
implementing them? So, this will be the topic of our next lecture, where will
look at a very, at this point in time and often used technique called batch
normalization okay and that seems to help quite a bit for faster training and
convergence and get rid of many of the problems okay, so one of the things
before we go there, one other thing I think I pointed out but this created
is that if we have very large features value, it is quite possible you will hit
the saturation of your nonlinearity okay, especially if you are using sigmoid
that can easily happen, if the weights also blow up. Okay, that is a possibility.
So, let say x2 gets a very high weight, then you will end up in the satura-
tion where it means there be no learning, so that is typically why you want
to have all your inputs to be scaled within certain ranges okay, so how do
we address this when it happens in a hidden layer okay, that can think of
it, there that is a problem to, so we will look at this batch normalization or
batch norm is called batch norm which acts as a layer, one of the layer in a
deep network and how it helps in faster convergence, that will be the topic
of our next video. Thank you.
874
Dr. Ganapathy Krishnamurthi
Batch Normalizing
Hello and welcome back. So in this video, we will look at this technique
called batch normalisation which helps in training the network, a deep neural
network better and and it is kind of you can treat of it as a continuation from
the data processing, preprocessing that we saw in the previous lecture, okay.
So, what first what is, we will look at what is batch normalisation and
then we will consider you know what is the problem here when you are train-
ing in deep neural network, so what happens, okay. So when we train a
network, what happens is that the distribution of each layers input changes
during training. We will see why that is in the next slide but we can see
that as we train, because the weights keep changing, the input to a partic-
ular network, particular layer in the network would be changing dynamically.
So because if the weight changes drastically between 2 iterations, you have

the same issue all right. And the solution is to somehow you know make sure
875
that the distributions do not change too much in the sense that the distri-
butions of the inputs to the layer do not change too much okay. So let us
see how what do we mean by that and how we can address that problem okay.
So, let us consider this in a slightly more like in a functional form let us
say okay. So, F1 and F2 are some transformations. So, what is a transform
that happens in a neutral network in a layer? So in a layer, what happens?
You have w> x + b, okay, where x is the input to that layer. This is one
transform that happens. And you pass it through a nonlinearity, so then you
can get something like, 1+e−(w1 > x+b) .
So, that is the transformation that happen, right. So, when w changes
with every iteration, you can see that you know the inputs to the particular
layer will also dramatically change. So you can think F1 and F2 as the trans-
formation that happens to your inputs at every layer, okay. So see you have
two layers in succession, each layer characterised by θ by these parameters θ
which is nothing but the weights and let us say another layer 2 characterised
by this another set of weights, θ2 , right?
So, once if as θ1 and θ2 keep changing, you can see that the so if θ1
changes, then F1 will change. So the input to F2 changes right? Input to F2
will change. And if θ2 changes, then F2 itself will change, again l will change,
876
okay. So if you have l is the input in other layer, then l will change. In the
sense the change is of course, it is expected to happen because we are trying
to estimate θ1 and θ2 if we think of F2 and F1 as the layers in a deep neural
network.
But since these changes, if they are in the sense these changes are kind
of random and large sometimes, then you have problems with conversions
in deep neural network, okay. So, what we do to address this problem in a
network is to normalise each activation, okay. So this is before we apply the
nonlinearity, typically that is what is done is we can do, we always use x as
to clarify the notations, we always use x to denote the input. In general, this
is our training data input as what we, we usually use x for.
For the purposes of this video, think of x as the input to every layer,
okay. So, every layer has a set of neurons and every neuron has an input
coming into it. Right? And what is that input? That input coming into
every neuron is w> x + b, this is the output from previous layer, okay. So if
there are k neurons in a layer, there will be k such terms, right? k terms,
or k inputs. k terms or k inputs. So this is the input that is coming into a
layer and what we do is so I have used x pretty much for everything, abuse
of notation but then what we do is, we update x as follows:
x − E(x)
x→ p
V ar(x)
This is what we saw earlier, this is your typical z-score normalisation, but
then how do we calculate expectation of x? What is this expectation of x?
okay. So we will just go through the algorithm and then we will be very clear
as to what, how this mean of x is calculated for every layer.
877
Okay, So here is the algorithm, okay. We are considering one layer, okay,
considering one layer, okay. And let us say it has k neurons, it has k neurons
and we are just trying to see how this batch normalising transform can be
applied to that, okay. So, if you have k neurons, right, you have this say k
is 4, then we have inputs coming in from the previous layer, right? I am not
going to draw it because it is too confusing.
So, multiple inputs coming in from the previous layer and then of course
there is this w> x, the affine transforms that we do, okay. w> x+b, right? And
then followed by nonlinearity, that is the output of that particular layer, okay,
that is your activation, okay. So then once we have that, so how do we cal-
culate this? So for every neuron, if we just take one neuron. Let us just take
one neuron. So, we have all these inputs coming to that neuron okay with
which you can calculate the linear combination, w> x + b.
Then following, and then you apply the nonlinearity to that right, that is
the output. Now, what do we mean by calculating the mean of x? So, this is
one input, okay to that particular neuron. So, what we do is, we consider a
mini batch of m training samples. So there are m training samples in a mini
batch, right? And we can once in the forward pass, we can actually forward
pass all the m samples in succession okay and we can compute this w> x + b
878
for each mini batch.
So, if we have m data points, then we will have m such calculated values
for each neuron, okay. So, that is for each activation, prior to passing it
to a nonlinearity, you will have m such values corresponding to each data
point in your mini batch. So, this mean is calculated over the mini batch.
So this is for A neuron right. So, A neuron and you have m input data points.
This neuron is let us say of the first or 2nd layer and but then when you
do the forward pass for A neuron using the weights that have already been
estimated or randomly initialised, as you do the forward pass through the
network, for every point in that m points in that mini batch you will have
one linear combination, okay. So, we will have m such linear combinations
with which you will estimate a mean, thats your mean and of course once
you subtract that mean out, you can estimate the variance square or the
standard diviation square or the mini batch variance.
And you will normalise every neuron with that every A. So for i equal
to 1 to m over the over individual input training samples of that mini batch,
you will calculate this normalised data point, okay. And once you have done
that, then we define 2 parameters, γ and β, again for every so this if there
are 4 of them here, there will be a γ1 , β1 , γ2 , β2 ,γ3 , β3 and γ4 , β4 . So for
every neuron, they will have two hyper parameters, γ and β and you will do
this transformation.
So how are γ and β estimated? They are estimated through backprop.

Because all this is you can think of this as a linear layer in a network and
that is how it is typically interpreted. So this linear layer is inserted between
your affine transform which is the linear combination of your neuron activa-
tion from the previous layer and the nonlinearity you apply, okay. So it is in
between these 2 layers you have the batch normalisation layer, okay.
So what it does is that, so this is the transformation that helps to make

sure that your data distribution in the sense that the distribution of your,
activations of your that the activation that you complete for every neuron
does not shift too drastically, okay, they are confined to be within a certain
distribution. So when this happens, then training is automatically faster and
it converges faster.
So you can think of one example where this will work is when one of
the w’s are too large and you know it might lead to saturation, we saw that,
879
we talked about that earlier. So by doing this normalisation, you can prevent
that from happening, also by making sure you can estimate γ and β. So you
can also see that this is like a invertible transformation and it can γ and β
can be estimated so that yi can be just equal to xi , okay. That is very easy
to see , I urge you to convince yourself of that, okay.
So if the original calculated value is the one that is actually desirable,

then γ and β can be, the network would estimate γ and β to be the inverse
of this transformation that we did leading to a leading to identity okay. So
this might sound cryptic but you should read the paper, I did post the paper
up there and I urge you to read okay. So just to recap once again, so for
every neuron in a layer, you see that every neuron if this is a fully connected
neural network, you can think that is the one that we are talking about in
MLP, so every neuron in a particular layer, it gets inputs from the previous
layer.
We call those, we denote those inputs by x, okay. So and we are con-

sidering only one neuron at a time. So for every neuron, there is a linear
combination of the activations from the previous layer, that is what we what
we call w> x + b. Now for when you are doing training, there is a mini batch
of data points, m data points. So do the forward pass and we calculate
this w> x + b for every data point in that mini batch and we do a mean and
variance for that mini batch for that particular neuron, okay.
And then we scale the activation of that neuron for a particular data
point, input data point by the calculated mean and standard deviation, okay.
And then of course we multiply it by this γ and add by β to get a trans-
formed variable. So this γ and β and again estimated by back propogation.
So remember that if there are k neurons in a hidden layer, you will have k
such parameters. So it is the addition of k such parameters.
880
Once you have done training, so this is how it is trained. So for every
layer, you will have γ and β and it will be estimated when you are training
but do it by back propagation. Now once you have done training, how do
you do the testing and inference? So then you still you need to remember
for testing and inference, you would still have to you have to calculate this µ
and σ, right? You have the γ and β but you still have to calculate the µ and σ.
So what you do is for that, you can compute µ and σ over the entire train-
ing set for every layer for every neuron. Okay that is possible. And you have
already converged on the appropriate values for the γ and β for every layer
and for every neuron, okay. There are ways of computing µ and σ as running
average, again you can do that as well as you know some exponentially varied
average schemes are available but this is the way to, during testing you will
calculate µ and σ for every layer, this µ and σ are for the activations of every
layer, every neuron and every layer. Calculated by running it through the
entire forward pass dataset.
Okay, that will be added computation or you can just maintain a running
average during training. So each of these are fine. So that is one way of
doing it.
881
And one of the advantages, so apparently this the authors of this par-
ticular paper comment that it increases learning rate, so you cannot train
with a high learning rate leading to fast convergence because sometimes you
have high learning rates, you have large updates and sometimes get into sat-
uration, that will not happen with this because you are doing this you are
trying to constrain your activation values to lie within a range, typically that
is what we are trying to do and that helps, okay.
It can also help you remove drop outs, so it acts kind of like a the regulari-
sation effects of dropouts, advantages of dropouts apparently are also carried
over by this batch normalisation, okay. Improves stability during training,
same thing because if we have sometimes your activations can be very large,
sometimes your weights can become very large leading to you know poor
training and that can be taken care of by this batch normalisation, okay.
The extra computation burden is there because you are adding one more
layer between you know extra layer is added before every set of neurons.
So that is the thing. And you need to have a significant batch size. So we
use a batch size of 1, 2 or 3, some of the large problems we require, memory
constraints might, if your dataset is large, memory constraints might make
you choose very small size datasets okay. And in that case there will be no
effect. The statistical you know effects are lost by doing that. So then it is
882
no point doing that for those kind of problems where you have very small
batches. Reasonably large batch sizes, this will work okay.
So the one question that we have not addressed is the convolutional neu-
ral network. How do we do this in a convolutional neural network? Okay. It
is a very interesting question. Actually the paper addresses that. The paper
talks about how to calculate it. That would be a homework question, okay.
So now I have given you the homework already. So for you to read the paper,
I will upload the paper soon.
So read the paper and inside the paper, they do comment on how this par-
ticular batch normalisation can be implemented in a convolutional network,
remember that, what is being described in this video is how it is implemented
for a fully connected neural network. Okay. So that is all for batch normal-
isation. So that is we wanted to do these 2 videos together, basically data
normalisation as well as batch normalisation because I think it helps you
understand this better, okay thank you.
883
Dr. Balaji Srinivasan
Introduction to RNNs
Welcome back. This is the final module from this video on for the next
few videos, this is the final module of the deep learning series that we have
been doing so far. Next, we will do some conventional machine learning tech-
niques that were you know non-deep learning techniques in the next week. So
this final module is for what is called recurrent neural networks. So far, you
have seen a couple of main techniques for deep loaning. One is for artificial
neural networks, very heuristically speaking, this is good whenever we are
dealing with pure numerical or number like data.
This is in the context of again I am saying this in the context simply of

engineering or science problems. For example if you want, you know you have
temperature and pressure somewhere and you want the density for a specific
application, maybe you could, if you think that these are the only 2 variables
in play, you could use artificial neural networks to predict something of that
sort. Once again, in week 10 or so, we will actually see a whole variety of
applications where this distinction will become clear. So that is for artificial
884
neural networks.
And you also saw this of course is Alex net we also saw convolutional
neural network, CNNs. These are primarily for image like data, okay. So
when you have data in the form of pictures, then we use CNNs. Once again,
we will see applications of this. Dr Ganapathy has shown you if you appli-
cations already in medical image analysis, we will show you a few more in
proper engineering problems or science problems in a couple of weeks from
now. So we also saw that CNNs are essentially a special case of artificial
neural networks. Now, you have a third class of data or 3rd class of problems
which is what RNNs deal best with okay.
885
Now ANNs and CNNs have a couple of lacuna, they have a few problems.
The problem is that your inputs are of fixed size okay. So for example if I
use temperature, pressure, that is what I am forced to use each time. Now
there is a subtle point here in what fixed size means. We will come to that
when we come to RNNs, both in this video and in the next okay. Now, more
importantly you need that the whole input be available simultaneously. So
you have to give the input for example the image here has to be given one
shot.
You cannot later decide to give another image. So that image is given
and you get an output. Okay. Now compare that with something like let
us say making closed captioning for what I am speaking. Okay, So you do
not have typically and suppose you have online translator, that is as I am
speaking, somebody is writing down what I am speaking or Alexa interprets
your inputs as you speak okay. Or your car does things as you speak.
886
So something of that sort requires sequential processing. So, some of

the typical problems that are reduced in RNNs, RNNs are recurrent neural
networks, involve typically speech processing, okay. So, like I said Amazon
Echo, etc will be using an RNN somewhere within there. So speech recog-
nition will be there. Language translation, for example Google translate, so
that uses some form of RNNs, okay. Video analysis, etc, all these things
where sequence matters, that kind of problem typically involves recurrent
neural networks.
So the key ideas which are important in deciding whether RNN is an

appropriate model to use or not is a variable sized input, okay and sequential
information. Even if it is not quite clear what variable choice input is, it will
become clear by the end of this video and the next video. But sequential in-
formation is something that is quite important. Now sequential information
typically means in engineering speak, time like or time series like data. Now
of course, RNNs are also used and currently the most popular use of RNNs
is in language. Okay.
Or in what is called natural language processing. This involves series of

words, like I said translating from one language to the other, Google has
Google translate, all those applications use RNNs because words come and
they are in a sequence also, sentences are not necessarily of a fixed size. Okay.
887
You can have a short sentence, you can have a long sentence and you know
that kind of you have to translate sentence between sentence.
Now these, we will not be discussing really language applications, that is

not the purpose of this course. We are only going to basically discuss overall
ideas behind RNN, behind RNN some of the architectures just like in CNN,
you saw a few architectures, we will be discussing a few architectures within
RNNs also and we will also show you, you know what kind of tweaks you
require for putting it into engineering problems, specifically actually in week
10 but I will show you one example this week also.
Andre Karpathy has written a very nice blog post, also there is a course
called CS231n which is called conventional neural networks but it does dis-
cuss basics of RNNs as well as CNNs. So this idea is actually borrowed from
Andre Karpathys. A very nice blog post which we will post on our website
also. So this is called the unreasonable effectiveness of RNNs. So I would
recommend that all of you take a look at it, it has some very nice language
examples, we will not be using too many language examples at least within
this course.
So but if you are interested, you can take a look at that. He also has
posted his own Python code on github there, okay. So this is not very stan-
888
dard classification but Karpathy has classified the types of RNN architectures
that you will see into 5, so this is called one to one, this is called one to many,
this is many to one and these 2 are many to many. Okay. Once again, this is
not really classification that is there in the literature but I think it is a great
sort of way of classifying it for introductory purposes.
So we will look at that. So let us look at some examples of where RNNs

are used. Once again, very notionally and we will go into depth a little bit
later in the next few videos, okay.
889
7
890
So here, this red block here simply is our usual input, let us call it x,
this green block here in the middle is our hidden layer and the blue block
here is our output layer, okay. So as usual, input hidden, output. Now this
one-to-one classification is basically our usual ANN, you can even think of it
as a CNN in case there is only one or many, you can even think of this h as
as if it is many layers but let us assume it is one. So it is not a particularly
good CNN. But this is, let us assume this is simply a ANN with one single
hidden layer and you have an output layer.
So whether it is RNN, ANN or CNN, as long as there is only one, you

basically can say that all 3 are equivalent. So this would be what would be
called a vanilla RNN or a simple RNN structure, okay. So up until so far,
there is nothing impressive about an RNN. So let me show some results, I
must point out that I got these results of the web a little while back. I have
kind of forgotten where I got them from. So if somebody can point it out, I
will actually put up an explicit acknowledgement on the website. So please
let me know.
So this one is a one to many classification and this is where RNNs kind
of start differing from ANNs and CNNs. So what is happening here? You
have a single input and you have got a bunch of outputs. Now what is that
like? This is as if you have a single image which is the input to your problem
and what the output here is a bunch of words. Now how does this magic
happen? Again, we will see a little bit of it in the next few videos but the
basic idea is very simple, as usual, it is just a mapping.
You somehow have to map one input here to a bunch of outputs. Now
what each of these outputs means, we will see shortly, okay. But let us as-
sume this corresponds to word1, word2, word3 okay. Now you can see that
just by seeing the image, there has been a fantastic output here saying a
dog is running in the grass with a Frisbee. Okay. So that is remarkable. Of
course, it is not simply a one to many RNN as I have shown here. There
are many other tweaks going on and that is well beyond the content of this
course to go into every detail.
But hopefully at the end of this course, you will be able to read papers
like this, image captioning papers and be able to figure out what is actually
going on, okay. And similarly, there is a question asked based on a figure and
you are supposed to figure out you know which of these choices is correct and
the fact that this picture was taken during a wedding. As you can probably
figure out by what we have talked about so far, a lot of it depends on how you
891
train. So training is a very very important part of how this kind of output
can be gained. But a simple, this was just shown just to show you a simple
example that there could be one input and many many outputs, okay.
Then you could have a bunch of inputs, the opposite case which is many
to one. You have whole bunch of inputs and one single output, what would
that be like? That is like giving a sequence of words and trying to find out,
this is called sentiment analysis, an example of this is just time to find out
whether this is positive or negative. So if you read somebodys feedback re-
port and somebody has written it in words and you simply want to figure
out, is this a positive feedback or is this a negative feedback.
Some students at IIT Madras actually did a nice work of analysing Twit-
ter feeds of many companies stocks and trying to figure out whether the
sentiment market, sentiment for this was positive or negative, this can be
done automatically. Once again, if you use an RNN structure for this, it
would be a many to one structure, a whole bunch of words with one output,
is this positive or negative? You can do several things with this. The 3rd
kind of task would be many to many and this itself splits into 2.
A many to many is such that you have a whole bunch of inputs, you
have lots of outputs but the outputs need not come simultaneously with the
892
inputs, further the outputs need not have the same size as the input, okay.
For example if I change, if I translate one language to the other, suppose
whatever I am speaking is changed into Tamil, the number of words in my
English speech need not be the same as the number of words in my Tamil
speech okay. So similarly, you can see Google translate also works, it is not
as if it is a one-to-one map.
It is just that some bunch of words are given and then after that some
other bunch of words comes out as the output and that would be a language
translation task. Similarly, speech recognition. Why would that be a many
to many task? One of course, when I speak, there is this length of my audio
signal. The length of my audio signal need not be the same as the length of
the number of words that I am using okay. So this is also a many to many
task where the input size is not the same as the output size, need not be the
same, okay not generally be the same as the output size.
It is just that some bunch of words are given and then after that some
other bunch of words comes out as the output and that would be a language
translation task. Similarly, speech recognition. Why would that be a many
to many task? One of course, when I speak, there is this length of my audio
signal. The length of my audio signal need not be the same as the length of
the number of words that I am using okay. So this is also a many to many
10
893
task where the input size is not the same as the output size, need not be the
same, okay not generally be the same as the output size.
So whatever input you have, you have a corresponding output. At least

you have equal number of inputs and outputs in such cases. So all these are
I would say traditional examples of what RNNs are used for, okay.
Our interest of course is to try and find out engineering examples but be-
fore that, let me just show you the rough structure of an RNN, I just showed
it as a box. So let us get into slightly more detail within this video about
what is done. So the general idea is variably sized, sequential data. What
does variably sized mean? Okay. So variably sized means that even though
number of features is fixed, the size of data is not.
So this might be slightly confusing to you, it is best explained in terms

of a couple of examples. So we will see that later on, just keep this at the
back of your mind. Okay. What does variable size mean? And we will see
this both in this video as well as in the next video. Now what is the basic
idea of a RNN? Same as most ANNs which is you combine an input vector,
okay. You get an input vector and you get an output vector except in this
case you have a slightly different architecture.
11
894
It works like this. You will see this in the pictures that I have drawn so
far. You have one input vector here, you have an output vector, a hidden
vector here and we will have another hidden vector here, so on and so forth.
Now this hidden vector, let us say if I take h2 , is a function of 2 things. It is
a function of x2 , it is also a function of h1 okay. And you will see the next
video how this can be tremendously useful. In fact later on in this video
itself, I will talk about this.
So this is a function of 2 variables, x2 as well as h1 . So usually this,

many people find this very confusing. This is a usual representation. x, RNN
and it loops into itself, you will understand why this is so, very very shortly
okay. But most of us including Professor Andrew do not prefer this kind of
picture, we prefer this picture which is called an unrolled RNN. So, I would
recommend that at least for thinking purposes, you think of it in this way
but if somebody uses this, you should be able to understand it.
So I will show this a little bit more in this video. Okay. So remember
this. You have one hidden vector, it depends on the previous hidden vector
as well as the input vector at that particular time. This axis, even though
sometimes I will abuse notation and call this number of layers, it is actually
it is a stand in for time.
12
895
So as I keep on moving here, in all these pictures, you can think about it
as if you are moving in time okay, okay, okay.
So here is some theoretical detail. I will discuss this and then I will come
back to how this function works, okay. So these are if somebody is from a
computer science background, theoretically remember that we had universal
approximation theorem for artificial neural networks. That said that I can
take a large enough ANN which will approximate any function that I would
like okay. Now RNNs have a similar property, in that they can simulate or
they can come close to approximating any program. Okay.
So any computer program can be sort of simulated because remember, for

a computer program all your inputs might not be available simultaneously,
okay. So you might say something now or you might give some data now and
after some time, the computer asks something else and you give new data
at each point, okay. So RNNs are supposed to simulate a generic function
which can move and it can take inputs at varying time instants. Okay.
So that is the speciality of RNNs. So once again theoretically we know

that RNNs are what are called Turing complete, that is they can simulate
any program okay. After this brief note, just sort of a theoretical note, we
13
896
will move on to a practical thing. Okay.
So let us come here. Let us come to the structure of an RNN. Let us

take a simple example and explain it via diagram. Okay so remember the
structure. Okay.
14
897
So let me show you a problem and we will come back to this picture.
Okay. So let us say I want to find out todays temperature. I go on Google
and indeed Google gives me todays temperature or yesterdays temperature
as it might be okay. Now not only does it gives that, it gives you a few other
things. It gives you tomorrow, day after, the day after that, etc., etc. Now,
how does it do it? Now of course, they are not using RNNs here, I want to
be very clear and even though I am using weather prediction as an example
or temperature prediction as an example, it is a very simple example with
which we can kind of understand RNN ideas but I do not think that Google
is actually using any such thing.
It is probably using very very traditional tools. RNNs have not yet be-
come very good enough to actually do weather predictions as of now, as far
as my knowledge goes. Okay.
15
898
16
899
But let us come back to this picture here okay. So let us say I make a
simple prediction, I will erase this for a short while and then I will reintro-
duce this. So let us say I have todays weather, this x here denotes let us say
todays temperature and this is my RNN structure, my RNN structure works
this way. This here will predict let us say tomorrows temperature. Now x
just to be a little bit more realistic, apart from todays temperature, I give
todays temperature, todays pressure distribution of you know various fields,
let us say todays humidity and todays rainfall, let us see all these are my
inputs.
Let us say I am taking 4 features, okay. And tomorrows temperature is

my sole output. So ŷis a scalar 1 × 1. x̂ or x is a vector which is 4 × 1,
okay. So let us say I have a neural network, this is that neural network.
Very simple neural network that is trained in the following way. You give
todays temperature, pressure at a particular place or a particular height or
something, humidity, rainfall and then you predict tomorrows temperature.
This would be a one-to-one prediction and that is just an ANN, it is not an
RNN at all, okay.
But suppose we want to do this task. Not only do I want to predict Sat-
urday, I also want to predict Sunday, okay, from just todays temperature and
todays pressure, etc, find it, okay. So we are not doing anything else. Now
can we exploit something? That the interval between Friday and Saturday
is the same as the interval between Saturday and Sunday. So if I say that
this output was some function of x, this output should be somewhat similar,
okay in its relationship to this output.
Now that is kind of encoded within an RNN. So what do you do? You
recognise that somehow the relationship here is similar to the relationship
here and from Sunday if I want Monday, that relationship is also similar.
Then again Monday to Tuesday, so on and so forth which is what is being
done. Of course what will happen if you do this in practice is if you have a
smaller error at some place, it will kind of propagate and it will grow but this
is the basic idea behind an RNN which is you try to incorporate within the
RNN, the idea of equally spaced, repetitive temporal relationships which is
that you want the same relationship between h2 and h1 as is there between h3
and h2 as is there between h4 and h3 , so on and so forth, okay.
So this is a very simple idea. When I write it in a formula, you will see
that maybe all this explanation was actually completely unnecessary. Okay.
Now before I go forth and all this explanation, even though you can keep it
17
900
at the back of your mind for intuition, it is not necessary that deep learning
researchers will agree with me that this is the basic intuition behind RNNs
okay. This works very well for scientific and engineering problems. For lan-
guage problems, it is actually quite amazing that it works but that works
because of the Turing complete property, okay.
So please keep this at the back of your mind, I am just saying this for
you to build your intuition specially if you are from an engineering or science
background, okay.
So now that I have shown this picture, I have one h1 here, x1 here and
let us say yˆ1 here. I have h2 here. Remember, I have no corresponding x2 . x2
would be the temperature and pressure and humidity and rainfall tomorrow
which I do not have because I am asking for all this data today. So I put
the search on Friday and I want, on Friday I want the result on for Saturday,
Sunday, Monday, Tuesday, Wednesday, etc. So I cannot really get x back,
okay. But xs information is somewhat encoded within h, okay.
Even though it is after a nonlinearity, etc, it is encoded within h. So I

want to extract from that h you know what could possibly be the temperature
tomorrow. Similarly, I will reuse this h and try to predict the temperature
18
901
day after tomorrow. So how do we do this mathematically? Sort of anti-
climatically mathematical it is very very simple, okay.
Mathematically, we will write it this way. In any box you have the fol-
lowing going on. You have ht, sometimes you have xt and sometimes you do
not. You have ht−1 coming from the previous instant of time and you will
see later that sometimes you take out an output and sometimes you do not,
okay. Now ht as I wrote before is a function we will call this variously fw , g,
etc. Function of ht−1 and xt . Now what is the most general function we
typically use within neural network?
It is very simple. We take linear combination followed by non-linearity,

always that. Now typically, in RNNs we usually use tanh for the nonlinearity
in the hidden layers, okay. So in this case, this will be tanh and we need a
linear combination of h and x. So there will be some weight matrix which
we will multiply h and some other weight matrix which we will multiply x.
So these 2 weight matrices in general will be different. Not only that, they
will also have different sizes as you will see very shortly.
Now this weight matrix, we will call hh because it takes an h and gives
out an h and this weight matrix, we will call xh because it takes an x and
gives out h, okay. So this is the general formula for the hidden layer of an
19
902
RNN, some people will replace this tanh by fw or by g, okay. Now what
about this yˆt ? yˆt is equal to some function of ht . Now in some cases, it
simply makes sense for this function to be a linear function. In some cases,
it makes sense for the function to be a non-linear function.
Also it depends on whether you are doing a regression task or a classifica-
tion task. If it is a classification task and let us say it is a binary classification
task, then g will become a σ. If it is a multiclass classification task, you will
use a softmax. And I will show you one example in the next video where
we will have a fully connected layer with a full nonlinearity followed by a
softmax. So all this can be, it can be linear, non-linear depends on what you
want to do, okay.
So in something like the task that we have chosen, we would probably

make this into a simple you know if it is a single output, we will have like
a σ followed by a linear layer, something of that sort, okay.
So let us do some further modifications on this basic example. So the

example I started with was I have x, x goes into h1 which depending on how
far I want to predict in the future, goes into h2 , h3 , etc. Let me erase this,
I will come back to this later and I take out temperatures. This is a classic
one to many case, okay. So this is the case, that we just saw. Let me put
some numbers just for you to get a little bit of clarity. So let us say x is 4 × 1,
20
903
okay. We already know that y1 is 1×1, I am only predicting the temperature.
x took more things. If you find pressure strange, you can look at events
speed or something on that sort. So let us say we have some 4 features that
we are taking in as input and we are predicting tomorrows temperature, day
after tomorrows temperature, so on and so forth, okay. How many? As many
as we want. So here itself, you see the variation in sequential length. Not
only can the input be of varied length, the output also can be of varied length
because you can predict forever as against an ANN or a CNN which will be
able to predict accurately or not, its output size is also fixed okay.
So, herein the output size is variable. So that is another property of

RNNs okay. Now let us see how input size can also be variable. Okay. So,
let us say I made this prediction. I took todays temperature, wind speed,
pressure, whatever, I gave some humidity and let us say precipitation. So I
took these 4, I predict todays tomorrows temperature, okay. Now tomorrow
came and I actually remeasured this data. Took that days temperature, I
think I call it pressure, humidity, and rainfall and then I made this.
Now why cannot I make simply this prediction? I would like to reuse
this old data. Okay. So yesterdays data also tells me something, todays data
also tells me something, okay. So now instead of making this an independent
ANN, I actually reincorporate this data by using this hidden layer input and
this would actually become a RNN. Now I can use this and I will get slightly
modified vice now. So now if I include this, this becomes a many to many
prediction, okay.
So many to many prediction could be, I have maybe 2-3 days data and
I predict 2-3 days data later. In fact recently, Google had a claim actually
they have not put up the full paper online as yet that they have been able to
predict an example of the RNN prediction could be wind farm usage predic-
tion. So this is fairly recent, about a week ago which would make it around
March 8 or 9 of this year they made this and they were able to predict you
know how much their load would go up or come down and that would almost
certainly be an RNN prediction.
What would they use? So they would use historical data of all the usage
so far. So you can see now you will have a variable size input. A variable size
input would be as and when data keeps on coming, you can keep on adding
this new data and your answers will change. Now an example of this in fact,
a good example of this is when you start typing in some search terms within
21
904
Google or nowadays even within Gmail, when you start typing something,
as you change what you are typing, its prediction will change for what you
are doing okay.
In fact even cellphones do that. I am not sure which of these use an RNN
but I am fairly certain that Google is using currently some version of some
RNN sitting there with a combination of traditional AI techniques but that
would be an example of a variable size input. As you give more, it gives you
better or different prediction. You cannot do this with traditional ANNs or
CNNs, okay. So an example of a many to many thing would be, I give many
days data and you give out future prediction of many days temperature, okay.
So that would be many to many.
And I would recommend that try looking at all classifications using this
temperature example and you will see that it fits in very well, okay. You will
see that you are able to find an analogy and that is good for people from an
engineering or science background because we are not going to do, let us say
a language task in this module at least in this course, okay. Now, before I
end this video, I want to give you some idea of the numbers that will actually
exist in this.
So let us take h2 . h2 will be f of let us say tanh(whh h1 + wxh x2 + b) in

general we should, I add a biased unit, I will talk about this in the next
video. Now you need to think about what is the size of this h, okay. x was
size 4 × 1, y was size 1 × 1. What about the size of h? The size of h is like the
size of any hidden layer of a neural network, this is arbitrary, okay. So you
can give 100 neurons, you can give 200 neurons, how many ever you want.
Obviously you want to give a smaller number of neurons and make sure that
your prediction is fairly good.
Now one question I am pretty certain will come, it would have come for
you naturally even within ANNs or CNNs. What does h mean or what does
the data in h mean? So we have a meaning for this or we ascribe some mean-
ing for x which was temperature, let us say humidity, rainfall and pressure. y
was tomorrows temperature. What does h mean? Actually we do not know.
So what do we do? We simply always remember that this is simply the
forward model. In a forward model, we simply postulate, we simply guess
for a relationship between x and ŷ and this is the relationship that we are
guessing, okay.
We guess this relationship and we just calculate about these w’s are, that
22
905
is all, that is all we do. And whatever those w’s turn out to be, h has no
further meaning other than that. As we have said a few times during the
course, the meaning of this, trying to interpret this is an open research prob-
lem. So nobody knows this, so why should this h go to this h? All we can
say is somehow the information behind x is being stored in h and you want
to reuse it, okay. Since you do not explicitly know the pressure, humidity
and you know rainfall you sort of guess that somehow that information is
hidden in h and I will reuse it. okay.
Now one final very important point which is what gives the power to
RNNs is that this whh and wxh and b, these things are constant with time.
What does that mean? If I write down h3 , h3 = tanh(whh h2 + b). You will
notice that there is no x3 in which case this gets wiped out. Now this whh
is the same as this whh . So you do not change weight matrices as you go
forward okay. So you do not change them at all because that would mean a
tremendous number of parameters for RNNs.
The power of RNNs or the reason why they are useful is that you can use
one whh , one wxh and get done, okay. So that is what gives RNNs power. In
the next video, we will be seeing a short example, a sort of a more detailed
example than this one, the temperature one and we will show it to you in
MATLAB and hopefully things will become a little bit clearer, you will also
get a little bit clearer about what variable size sequential data means, thank
you.
23
906
Example - Sequence to Sequence Classification Example
Welcome back. In the last video, we saw an introduction to RNNs. RNNs

remember stands for Recurrent neural networks. I told you about 2 important
properties that RNNs have that typically CNNs and ANNs cannot handle.
One of them was variable length, the 2nd thing is sequential data. So as I
had had a few times in the last video, the basic purpose of RNNs is to be
able to handle variable length sequential data. In this video, I will show you
a short example, again borrowed from MATLABs examples, this time explic-
itly borrowed from what MATLAB has given in their deep learning toolbox.
And hopefully, you will see what specifically we mean by variable length.
Sequential data should already be intuitive to you. So the purpose of this
video is not to actually introduce you complete to RNN but to introduce you
to what the use cases, you know how are we going to use it, okay. Hopefully
it will give you some intuition about what we are going to do. Now when we
were looking at RNNs, we looked at various classifications. These classifica-
tions are not really official classifications but they are given by a very good
907
researcher, Andre Karpathy.
And what we are going to look at in this example is a many to many

classification. So we call that a many to many classification, there were 2
subsections that we looked at, again this is Karpathys way of doing it. So
suppose you have various inputs, they could go through the sequence of hid-
den layers and suppose you have an output corresponding to each of these
inputs. This would be a many to many classification. Now we will do an
example of that sort shortly within this video.
Now I would like to draw an analogy between what you will see here
and this is called sequence to sequence classification. So I am going to give
an analogy between this and what happened when you did something like
semantic segmentation when we were looking at CNNs. In CNNs, we try to
label each pixel as belonging to body 1, body 2, or body 3 as Dr Ganapathy
showed you. So a sequence to sequence classification is somewhat similar.
Okay.
What does that mean? It means that suppose you have a time sequence
and a few possible events could happen, okay. So a few possible events, so
I will give you one good example of this. So let us say you are driving a
car and you have your cellphone in there. Remember that cellphones have
accelerometers. If you do not know this, it is an interesting thing. So what a
cellphone has within it is some way to figure out which way the acceleration
is going and the reason for this is, a simple reason for this is so that when
you rotate your cellphone, the picture can rotate too.
So if you have a smart phone, it has, first brought out by iPhone as you
remember, so this have this capability to rotate your picture because it needs
to know which way you are accelerating and which way is up and which way
is sideways. So cellphones have these accelerometers. So let us say, a you
keep a cellphone within your car, okay, a smart phone within your car and by
your acceleration signal, it should be able to say whether you are stationary,
whether you are moving or at a you know whether you are accelerating or
whether you are in an accident, okay.
Now this can be particularly useful and I am not sure, I have heard of
a person who actually did this, a student who did this from IIT Delhi as a
project. I am not sure whether he used machine learning or not but basically
the idea was suppose you have an acceleration signal for your car and you
suddenly have a rapid change, you can kind of figure out that this person has
908
been in an accident or at least the cellphone can figure out that this person
has been in an accident.
So a sequence to sequence classification for this would be at each point it

will say moving, moving normally and at this point it will say, in an accident
okay. So in such a case and in fact what this person did as I heard it was
based on the acceleration signal of the cellphone, this person was able to
alert the local ambulances or the local police authorities because apparently,
the strongest saving of a life can be done at early times and by the time this
person has an accident, somebody sees that this person is in an accident and
calls the police or the ambulance or somebody, it is already sometimes too
late.
But if the cellphone can figure this out and call immediately, then this
is a great case. So I am going to show a variation of this, actually given by
MATLAB, also based on accelerometer data within a persons pocket, okay.
So let us see that case.
So the example that I am going to show is from this link. This is

both available within your online MATLAB account for the duration of this
NPTEL course. So I am going to just introduce this case and I am not going
to show you the training really, I am just going to show you snippets of the
909
code from here and actually snippets of the description given by MATLAB
itself. I would highly recommend that you actually go to your online MAT-
LAB account and run this code. This is also available directly on MATLABs
website, the website that I have shown here.
Please take that, run that code and see for yourself how this works. There
will be a few terms that will be used throughout this video, one of these terms
will be LSTM. This is a term that I have not explained so far. This is a type
of RNN, okay, the most popular architecture of RNN available today. So the
relationship between RNNs in LSTMs is somewhat similar to the relationship
between CNNs and any particular architecture, let us say Lenet or Alex net,
all those architectures that you saw.
So LSTM is a certain type of architecture of an RNN. So that is what is

used within this example. So whenever you hear LSTM please think simply
RNN okay.
910
So let us come to this example. So this is from MATLABs own descrip-

tion of this problem. So as it says that it is using LSTM which is, which
stands for long short-term memory as we will see later and we will see the
reason for this term also okay. So now what we are going to do is a sequel
to sequence RNN or LSTM network. Basically what it takes is, it takes each
911
time instant and it predicts what is the activity. So that is the basic task
here. So time instant to activity prediction.
Now this can be tremendously useful, I told you one simple way of using
it with let us say keeping an accelerometer within the car or keeping a smart
phone within the car. But in this case we are taking a toy example, let us
say this person has their cellphone attached to their body and based on the
activity, the cellphone is supposed to figure out whether this person is sitting,
standing, walking, dancing or running. Okay. So these are the 5 classes that
we are going to look for.
So that is shown here. So the accelerometer is not measuring acceleration

in only one direction. But it is actually measuring acceleration in 3 different
directions. So the sensor gives you output in 3 different directions or accel-
erations in 3 different directions. And you measure this data for 7 different
people, this is the training data. So as you can see, we are throwing up a lot
of numbers. Let us sort of unentangle, disentangle them shortly, okay. okay
Now where does the variable length come in? So each sequence has 3
features. What are these 3 features? Let us call them acceleration in direc-
tion X, acceleration in direction Y and acceleration in direction Z. So these
are the 3 features that the sensor is throwing up at each time. But it varies in
length. Why does it vary in length? As you will see shortly in a graph, what
is it that we are measuring? What we are measuring is, suppose a person,
let us say I wear this cellphone on my body or have it in my pocket, so let
us say I draw a graph with time and I am doing some activity for a certain
length of time, okay.
So my acceleration signal looks like this with time in X, looks like this
in time with Y and looks like this with time in Z. Now suppose I am wear-
ing this accelerometer for let us say one hour, okay. So the length of this
sequence is suppose I am taking a signal every 5 seconds or so, okay. So if I
am taking every 5 seconds or so, I will have 3600 × 5, I will have my length of
my sequence here because I will give one data point per 5 seconds, I am go-
ing to get 18,000 is the length of the sequence for the senses that I have worn.
Now some other person, let us say you, you are wearing this but you might
not wear it for one hour, you are wearing it for 30 minutes. In that case,
your sequence length will be 9000. Somebody else wears it for something
else, okay. So let us say that person wears it for 2 hours and that person is
length is going to be 36,000. Now unlike CNNs where we were doing simple
912
padding, it is sensible when if you get more data and you want a sequence to
sequence prediction that you would like to take each one of these sequences
and still give a prediction.
This is something that is not normally possible with ANNs or with CNNs.
So variable length means your data, your time length can actually be differ-
ent for different people. So let say, if you are trying to transcribe a youtube
video and you are trying to say you know what has this person spoken or
you have Alexa or you have Google, Amazons Echo, something that is tran-
scribing our speech, not all of us are going to speak for equal time instant.
Each of us will have variable lengths.
Similarly you want to interpret or translate a sentence, each of these sen-

tences is of different lengths. That is what is meant by variable length, okay.
So this is an important property of an RNN that the length of the sequence
can be different for different datasets, okay. So each dataset can have dif-
ferent lengths. Even though the number of features is the same, okay. So
please distinguish between the two. When we see variable length, we mean
the length of the time sequence over with this data is given.
That can be different for different samples, that is the advantage for
RNNs, okay. And your final test data might be of an entirely different se-
quence length. So suppose I could say, I could just give 30 seconds of data
and say what was this person doing? Okay. It would be a very short sequence
thing but you could still be able to classify, okay. So number of features how-
ever is always fixed. That is the case whether you have ANN, whether you
have CNN or whether you have RNN, number of features we will fix, okay.
So this is the problem that we are starting with.
We are given, ax with respect to t, az with respect to t, ay with respect to

t, and we want to predict at each time instant, what was this person doing.
Such a task is called sequence to sequence classification task, okay.
913
So the way MATLAB does is it has its own dataset and I again, I recom-
mend that you see this dataset from the site yourself and maybe explore it
a little bit. So it has human activity training dataset. Remember, this has 7
people, 3 features- ax , ay , az and variable length. An important property of
the RNN which I mentioned even in the last video is when we take different
data points the structure of the RNN is such that these have to be equally
spaced in time, okay. For engineering problems, this I would recommend very
strongly.
For engineering problems, it is best when you have, when you are gen-
uinely dealing with time, as against let us say a language task where you
know one word following the other is not really a time problem but when
you have an actual engineering problem where time is one of the variables
with respect to which you are doing an RNN, please try and use RNNs which
are equally spaced in time or get datasets which are equally spaced in time.
This is an important notion that you should be aware of, okay.
So here we will assume that the sensor took its data at equal time in-
stances okay. If you do not do this, it will be much harder to train it because
of certain features of the RNN which I will point out, okay. So let us say we
load this and we print what this X train is.
914
So you can see that this training set has 6 people. I had said 7 people but
we are basically keeping one for testing. It is a small dataset. You cannot
really do that 60-20-20 split, etc. So you take 6 people and tried to predict
for one, okay. So this training set is simply 6 people. Now notice, each of
them has 3 features. So this is example 1, example 2, example 3 up till ex-
ample 6, okay. But each of them has different sequence length.
So person 1 actually measure or their activity or continued their activity

for a much longer time because it is going for 64,480 whereas this person is
probably the shortest in this dataset. This person went for 50,688, okay. So
this is to say that you can train, even though each person actually gave you
data for different lengths of time. Okay. So that is the advantage of an RNN,
okay.
915
Let us now visualise the data. So we will just take person 1 and take
one of the accelerometers or one of the directions or one of the sensors in the
accelerometer and plot just that. Okay. So that is what is plotted here. You
can see the time step, 104 , approximately 6.4 × 104 , so 64, 000 data points.
And we have the time signal, what we have plotted here is the signal of the
accelerometer. Now as should be fairly obvious, when the person is sitting,
you do not see very much acceleration.
Maybe this person you know like I am moving around, moved around a
little bit okay. Now this dataset is labelled, what we have labelled with is
you know we know the activity this person is doing. This person is sitting
and at this point, at each point you say that the person was sitting. I will
show you how to represent this in an RNN structure shortly, okay. From
this point to this point, the person got up and you can see, there is a rapid
change in acceleration but as of now, we have still labelled it.
The label we gave was that of simply sitting. We will see the effect this
has on the test set much later in a later video. Now at this point, this person
is simply standing, that has the label here and you can see that even here
the variation is not much. But also understand that somehow this thing is
supposed to figure out you know how well this is doing based on here or here.
Now when the person is walking, you can see a lot of variation in the persons
10
916
acceleration, okay.
So this portion is walking, then the person starts running and finally the
person is dancing. Okay. So this is the dataset that is given to you. But
just based on you know acceleration in one direction, you cannot figure it
out, you have to see acceleration in all 3 directions. You know when you are
dancing, maybe you will jump around a whole lot more and your acceleration
in Z will be more, okay. Similarly in running also, okay. In walking, it will
be less intense in some direction or the other.
So it is a combination of these 3 activities with which the accelerometer

is supposed to or with which your RNN is supposed to figure out whether
this person is dancing, walking, running, etc. Overall, the important thing
here is to see that there are 5 classes, 3 features. Okay. Of course, we know
that 3 features correspond to the inputs, 5 classes corresponds to the output.
Let us now use this to create a RNN architecture or within RNN’s we are
basically creating a LSTM architecture. So we will create that and I will
explain how this gels with what I showed you in the last video and hopefully
things will be much clearer at that point.
So as I said just before, our input sequence has 3 features and it has 5
classes as output, okay. So suppose I draw that in a figure, just a single
11
917
point. If I write x, what comes out is y or ŷ. ŷ supposed to have size 5, x
is supposed to have size 3. Now why did I write 3 × 1 and not 3 × 64, 000 or
something? Because this is only the first time instant. Okay remember, this
is a sequence to sequence classification task. I take the first time instead and
somehow I am supposed to predict what yˆ1 , okay.
Now the magic thing there is this layer in the middle, this is what we
typically call h, h for hidden layer which is the terminology that we have
used within neural networks right from the beginning. Now what is the size
or what is the number of neutrons in the hidden layer? I will call this h1 ,
what is the number neurons in the hidden layer? This is once again, like
we were doing before, it is our choice, okay. Only thing is, once we chose it,
through the time sequence, we will keep this fixed, okay.
So this is in this case, the MATLAB example has chosen 100 units but
you can choose any number, obviously if you use smaller, you will have lesser
expressibility but you will train faster okay. So we are going to choose 100
units for this particular example. So effectively, you can see this as 3 input
neurons here, 100 hidden neurons and 5 output neurons. This is the neural
network, except turned on its head like this okay. So we usually have neutral
network going this way, we have neural network going this way, okay. Now
what is an RNN?
12
918
An RNN is multiple such neural networks put together, okay. So I start

with x1 , have h1 , get yˆ1 . But this h1 is not wasted. It goes to another h2
which also takes an input from x2 . Okay so please notice what is happening
here and see what in some sense makes RNNs work. What is it? Whatever
output or whatever hidden layer, whatever features this hidden layer had,
those are not wasted, they are reused into the 2nd time instant, okay be-
cause you want to know what happened before. You do not want to forget
that. We are not taking ŷ1 , remember all we are taking is the hidden layer
input, okay. This is sort of like what Dr Ganapathy talked about in transfer
learning.
I have already learned something, why waste that? I will transfer that in-
formation here. So this input goes here, h1 goes here and h2 is predicted, ŷ2 is
predicted. This is like recursion. Not only do I want to know what happened
now, I want to know what happened a little bit before. Then I take x3 , ŷ3 .
And the prediction method goes on like this. We will see later how to find
out the losses and back propagation for such an architecture but you can go
till some final time.
The point of an RNN is T is variable. When we say variable sequence,

that is what we mean. So x1 is 3 × 1, x2 is 3 × 1, what does x1 corresponded
to? x1 corresponds to the 3 accelerometer signals at time 1, x2 corresponds
13
919
to the 3 accelerometer readings at time 2, so on and so forth, up until time N
or time T , you collect the 3 accelerometer outputs okay. So what we had
shown here was one accelerometer, similarly you will have 3 such things at
each time instant. That is what you give here, okay.
Now what is ŷ1 , ŷ2 , ŷ3 ? You know this before, this is simply for a classi-
fication task. This is simply the softmax output, okay. Another way to say
it is, it is our approximation of a one hot vector, we know that this has to
be either 01000 or 10000, etc. And we will approximate it through a soft-
max output. Now MATLAB in this particular example has chosen slightly
differently. I will show you that. Before I show you, a little bit about ter-
minology. Often times in later videos, I will say number of layers. That is
strictly speaking, not correct.
Number of layers should be in this direction. This is length, time or se-

quence, okay. But I will sort of abuse notations and I will say you know
when the number of layers increases, it is going to be very difficult to train
but you should understand that number of layers actually in this direction.
Right now, I have shown you a case with effectively one hidden layer. But
just like we had neural networks before this which had more than one layer,
you can simply do the same thing. You can take x, put let us say 2 layers
and then get out y. In fact, MATLAB has done something similar.
14
920
If you see their description, after the hidden layer, there is a fully con-
nected layer which you know what it means. So after this, so let us say I
start with x1 , I have a hidden layer with 100 neutrons. After that I can put
a fully connected layer of 5 neutrons which means you know you are going
to get a full fully connected thing here. After this, you have softmax. okay
with 5, okay. So you can make a softmax layer after that, usually use our
softmax prediction formula and then go on from there.
But similarly, you can put 8 layers, 9 layers, typically people do not go
more than 8 layers and these are for very complex tasks. I will show you
a small example in a later video of a language task, okay. But typically in
layers, we go somewhere between for engineering tasks usually 1 or maximum
2 are good, okay. So specially if you are dealing directly with numbers, 1 or
2 layers is usually good enough. And you will see that it actually takes a lot
of effort to train such things, okay.
15
921
Now what is special about this RNN is remember that the weight that
goes from length to length or time to time is exactly the same, okay. Rec-
ollect if you have xt , ht−1 , ht coming out and let us say ŷt , then ht =
tanh(wxh xt + whh ht−1 + b). The subscripts are as follows. For xt , I will
call it wxh and for ht , I will call whh .
16
922
Now for this example, let us see what the sizes of these matrices ought to
be, okay. So let us see this xt at any point is a 3 × 1 matrix, h is a 100 × 1
matrix. So what should be the size of wxh ? Please think about it. Simi-
larly, ht−1 is a 100 × 1 matrix, ht is also a 100 × 1 matrix, what should be
the size of whh? Please think about it. ht is a size of 100 × 1, what should
be b? okay.
If you are thinking about it, please pause the video for a second, think
about it, work it out and then come back, I will show you the answer after
you are done, okay. I hope you tried it. So as you can see, wxh takes in a 3×1
matrix, okay. So this is 3 × 1, this should give you a 100 × 1 matrix as output
which means wxh has to be 100 × 3. Similarly, whh has to be 100 × 100 so
that it can take a 100 × 1 matrix and give a 100 × 1 matrix as output. Sim-
ple dimensional consistency or length consistency requires that you have 100
biases for a 100 × 1 matrix here, okay.
Now giving all this is fine. How many parameters do we have? You have
300 parameters here, 10,000 parameters here and another 100 parameters
here, okay. So you have 300 + 10, 000 + 100. So which is 10,400 parameters
for this problem. Now you may question that you have 10,400 parameters
but then what happens at the next time step? Okay. The catch here and
which is why I said that in engineering examples, it is a good idea to take
the same ∆t between this and this, this and this is that we will assume that
it is the same wxh and whh and b for each time, okay.
So the only w used w’s you use are the w’s that you will use for the first
layer, they remain a constant as you go across and RNN and that is the
power of an RNN. Otherwise, the number of parameters would explode. So
from time to time to time to time, you are going to assume that the matrices
do not change, that is the relationship between h3 and h2 is the same as the
relationship between h4 and h3 is the same as the relationship between h5
and h4 . This basically is what gives power to the RNN.
More than power, this is what gives compactness, okay. So once you use
this, you have your RNN, you have just set your structure this way. There
are various ways of compactly representing RNNs. We are not going to deal
with that in this course, this is just to give you an introduction. I would
recommend very strongly that even before you see the final video, you run
this particular case here and see if you are able to understand what is going
on. You can in fact see here that there is an LSTM layer which is made up
17
923
of a fully connected layer and a softmax layer and finally the classification
layer which will tell you how it goes.
Hopefully you also saw the number parameters that pop up in this prob-
lem, in this case 10,400 parameters and you have to train all these 10,400
parameters via back prop. Now how do you do back prop for this case? How
do you calculate loss in such a case? This we will see in the next few videos,
thank you.
18
924
Training RNNs
Welcome back. In this video we will be seeing how RNNs are trained.
You will see that there are several commonalities between RNNs, and let us
say CNNs and even ANNs, as far as training them is concerned. Of course,
as usual you should have a training set, a validation set and a testing set.
But that apart, given the particular structure of the RNN’s, there are few
certain things that you need to be aware of in terms of training. So we will
just go through those in this video. There are several even deeper ideas that
need to be conveyed, that we will not do, okay.
So in terms of implementation, luckily all the training has been abstracted

into the various packages whether it is MATLAB or whether it is tensor flow
or whether it is pytorch, etc. But there are a few ideas that will help you
later on when you will try to train RNN’s yourself. So the 2 issues that we
will be concerned with in this video is 1st calculating loss in an RNN. There
is a mild difference between what happens in an RNN and what happens in a
CNN or let us say an ANN. The 2nd issue is what is called back propagation
925
through time. Sometimes simply called BPTT.
So we will be looking at this from the overall view, I will not be doing too
deep of mathematics, this little bit of mathematics, we will be doing this.
And hopefully this will give you some insight into what actually goes them
926
into the code when it tries and trains RNNs. So, 1st let us look at the loss
function for an RNN. So let us see a simple structure. So, as usual you have
some h0 going in, let us say we have unrolled an RNN through many many
layers. So let us say the total number of layers is equal to capital T .
Why capital T ? Because we are thinking of RNN as something that goes

through time, so let us say this is T1 , it is the 1st instant, T2 the 2nd in-
stant, T3 3rd instant, so on and so forth and let say we are going to hT ,
okay. Now a question with an RNN usually is, where are we going to take
out the outputs? And as we saw in the introductory videos, you have several
choices, it depends on really what you want. In some cases you will be taking
out an output only here but there are several possibilities where or several
cases where you might be interested in let us say finding out outputs at all
intermediate layers also.
So just for consistency, I will call it ŷ, because that is what we had been
calling our model or predicted values so far, okay. Now when you have mul-
tiple predicted values, so let us take the example of, let us say having 10
days before is the weather of h0 or temperature of x0 in some city, let us say
Chennai, okay. So suppose you have that input, you would have the next
day’s temperature, let us say that is ŷ1 , the next day’s temperature ŷ2 , next
day’s temperature ŷ3 , till let us say today’s temperature which is ŷT .
Now for each one of them, you also have a corresponding ground truth,
which should be y1 , y2 , y3 , yT . Okay, so this is the ground truth. And when-
ever you have a ground truth and a prediction and these 2 differs, you will
have a loss function, okay. So not only do you have losses right at the end,
like we do add, let us say with ANN’s or CNN’s or at least usual architectures
of them, you can have possibly, I mean this is not necessary, as I discussed
earlier it could be optional. But suppose you do take out an output, you do
have a corresponding loss function.
So the total loss is actually summation of all the intermediate losses

through the layers, okay. Now in terms of Lt itself, or the local loss func-
tion, you again have many choices but we having seen only 2, you can either
use cross entropy or you can use least-squares, depending on what sort of
problem it is. Typically what we have done so far in this course is we have
used least-squares, whenever it was a regression or a numerical output. For
example let us say temperature today. And we have been using cross entropy
in case it was a classification issue.
927
For example you could ask will it rain or not? And in such a case, you
would probably use something like cross entropy is a loss function. In either
case, will simply say that
T
X
L= Lt
t=1
So this is the issue of calculating loss function, it is a simple deviation

over or simple correction over all the previous loss functions that we have
seen so far. Now more important idea or a subtler idea really is that of back
propagation.
So let us look at back prop through time, we will call it BPTT for
short. Now let see what the subtle issues involved here are, okay. So, as
usual, we will assume this, there is h1 , we start with an h0 , the input here
is x1 , h2 , x2 , h3 , x3 , so on and so forth up until hT , xT . And for now we will
assume that we are taking out an output at every single time instant, okay.
Now let us write the expression, for any of this h, we have hT is some g,
usually tanh, i.e., hT = g(whh ht−1 + wxh xt ) and we also had ŷt = g ∗ (wyh ht ).
928
g need not be the same as g ∗ , that is we can use a different nonlinearity

here and a different nonlinearity here, in fact in several cases we simply use
a linear prediction here, okay. So we call this wyh ht . So, these are the 3
sets of matrices. Now for this video and for the few that follow, just for
simplicity we will use a different rotation. We will call this matrix W , we
929
will call this matrix U and we will call this matrix V , so that I am going to
write ht = g(W ht−1 + U xt ), ŷt = g ∗ (V ht ), okay.
Now the most important thing about an RNN is W , U and V do not

change with time. That is another way of saying this across layers. So, this
is what it actually makes us, it is possible to train RNNs at least with a
reasonable amount of time. So just to clarify, I have this h1 which was this
multiplied by W and this multiplied by U and this h1 multiplied by V with
a nonlinearity of course in all cases, gave me a ŷ1 . Now the W here, U here
and V here are exactly the same, okay. And you will see how that plays out
in, when we do back propagation.
So unlike an ANN, where at each layer, these W s, U s and V s actually

change, in an RNN, they are exactly the same. So, we use the same W , U , V
for each layer. Now how does, it helps us of course, now we have fewer pa-
rameters to train. But while doing back propagation we have to be a little
bit careful.
So let us take a specific case. So, remember I want to find out for back
prop, we need to find ∂ of the lost function, I am calling it L here, you can
use J or L, depending on what you are comfortable with, for now I am us-
∂L ∂L
ing L, with all the matrixes. Now, this means we need to find out ∂W , ∂V
930
∂L
and ∂U . Because we have 3 matrices as far as RNN’s are concerned. So, this
sort of RNN that I showed you has 3 matrices, U s, V s and W s and I willing
to find out ∂L with respect to each, the ∇L with respect to each of these
weights, okay.
Now, remember that L itself is a summation of Lt , that is we find out the

loss at each of these layers, let us say I will call it L1 , L2 as I did just a little
bit before and I need to find out ∂L1 these 3, ∂L2 these 3, ∂L3 these 3, and
then sum these up and that will give you, that will give me ∂L with respect to
any of these 3. So let us consider just the sum of one of these. So let us con-
sider ∂L
∂W
3
, just to show you what happens, okay. So we will consider just one
of these local losses and see how we can apply back propagation in using this.
Okay, I will draw a small figure here just to repeat what we had before
for clarity. h1 , h0 , x1 , ŷ1 , h2 , x2 , ŷ2 , h3 , x3 , ŷ3 and ŷ3 leads to L3 because ŷ3 in
general will be different from y3 . So the matrix here involved is V , here U , W
and V , okay. Now, before we proceed, I am going to make some assumptions
on the structure, okay. I am going to assume that the nonlinearity here,
remember ŷ3 = g(V h3 ), we will assume that the nonlinearity here is simply
just the linear function, or it is a linear activation function.
931
That is just to make some of our derivation a little bit simpler. You can do
it for any case. So I will assume that ŷ3 = V h3 and in general ŷt = V ht , you
know that is not going to matter as far as this video is concerned. Okay, next
thing is we will assume that the loss function is a least square function. So,
let me make it simpler. So, I will call it, let us just deal with L3 = 12 (y3 − ŷ3 )2 .
That is just to make up a differentiation little bit easy. Once again you can
do this kind of derivation for any loss function.
Now given these 2, what do you need to find out? We want to find out,
let me do it here while the figure exists. We want to find out, ∂L ∂V
3
. Let us
∂L3
say, let us start with that. So let us say I want to ∂V . How would I do it?
3 ∂ ŷ3
Like we did before, ∂L∂V
3
= ∂L
∂ ŷ3 ∂V
. Now, ∂L 3
∂ ŷ3
= −(y3 − ŷ3 ), okay. So that is
fairly straightforward, okay. You can just simply differentiate this, as we did
before even while doing ANN’s, okay. Now what about ∂∂Vŷ3 .
∂ ŷ3
∂V
= h3 , there is a small catch here, I will mention that here now. I leave
this as an exercise, we will be asking this within this week’s exercise. Find
out based on the matrix sizes. Remember L3 is a scalar, V is a matrix, so
this whole thing is actually a matrix, okay. (y3 − ŷ3 ) is a vector, h3 is also a
vector, so please think about which sort of product should come here or how
should you arrange this, so that you get a matrix out of this product. So
please think about this, we will be giving this as one of the exercise questions.
Regardless, what I will write here is simply that ∂L

∂V
3
= −(y3 − ŷ3 )h3 . This
simply give you how much will the loss change suppose I were to change V .
That is fairly straightforward, how much will this loss change suppose I were
to change V . Now, there is a subtler question here or a harder question here,
which is, what is ∂L∂W
3
? Why is this a harder question? So let us 1st start
doing a similar exercise to the one that we have just done. So, suppose I
3 ∂ ŷ3 ∂h3
want ∂L
∂W
3
, mathematically ∂L
∂W
3
= ∂L
∂ ŷ3 ∂h3 ∂W
.
932
I hope this is clear. We just saw that ∂L 3
∂ ŷ3
= −(y3 − ŷ3 ), that should
∂ ŷ3
be straightforward. Similarly, if you come down here, what is ∂h 3
, please
∂h3 ∂h3
notice this, this is simply V . Now what about ∂W ? ∂W , for that we need
to know what the expression for h3 is. h3 , recall was g of some nonlinearity,
usually tanh of, let us see this here. h3 = g(W h2 + U x3 ), just to recall you
can see this, okay.
∂h3
We want ∂W . So for simplification let us call this z3 , okay. So this is
therefore g(z3 ), very similar to what we had for ANN’s. So, if you have that,
then
∂h3 ∂g ∂z3
=
∂W ∂z3 ∂W
0 ∂z3
= g (z3 )
∂W
∂z3
What is ∂W ? You have this derivative, this should give us h2 which is straight-
∂h2
forward, plus there is one more term which is W ∂W . Now, why does this
term exist, okay, is this 0 or is this nonzero? This is nonzero because notice
that h2 itself depends on W , okay. Just like h3 is dependent on W , h2 also
dependent on W , depends on W because it is the same W throughout. This
is the catch with back propagation through time. Unlike ANN where you
would have a W1 here and a W2 there, in the RNN, it is the same W every-
where.
933
∂h2
So that you cannot find this out independent of ∂W , okay. So, in sum-
∂h3 0 ∂h2
∂h2
mary you have ∂W = g (z3 ) h2 + W ∂W . Now suppose you want ∂W , you
will have to go back again. And this is why it is called back propagation
0 ∂h1

through time. So you have g (z2 ) h1 + W ∂W , etc., okay. So when you,
∂L3
whenever you want to find out ∂W , ∇L3 with respect to W , you will 1st
come down here, okay.
You have these 2 terms∂hsitting there, but this is actually a more complex
0 ∂h2
term, it is g (z3 ) h2 + W ∂W 2
. And in order to calculate ∂W , you will have
to go back, okay and so on and so forth. So similarly you will, whenever you
want to find out let us say ∂L
∂W
T
, it will involve all the gradients before. So you
will have these repeated sort of recursive additions sitting there and there are
very clever ways of writing these codes, as people have already done within
tensor flow, etc.
So we have now seen 2 terms, we have seen ∂L ∂W

3
, we have also seen ∂L ∂V
3
,
∂L3
which was straightforward. Finally let us look at ∂U . So, once again the
3 ∂ ŷ3 ∂h3
same thing, the 1st 2 terms will be the same, ∂L ∂U
3
= ∂L∂ ŷ3 ∂h3 ∂U
. Now, this
if anything is mildly trickier than the previous one, this is V . Now, if you
notice this figure. When I was doing ∂L
∂W
3
, you could see that this depends on
this, this depends on this, this depends on W , okay. And through the W it
10
934
comes through here.
Now it looks like this should go through when we take a credit with
respect to U , that it should go through straightforward but there is some
subtlety there to. So let us calculate this term. So, suppose I want ∂h ∂U
3
,
∂ ∂g(z3 ) ∂z3
remember this is ∂U (h3 ), h3 is g(z3 ). So this is ∂z3 ∂U , okay. Therefore,
∂h3 0 ∂x3 ∂

∂U
= g (z3 ) x 3 + U ∂U
+ ∂U
(W h2 ) .
Now it might seem like, this term is of course 0 because x3 does not de-
pend on U at all. Now what about this term? This term is like I said a little
bit subtle. So if I look at the term ∂(W
∂U
h2 )
, this can be written as W ∂h
∂U
2
+ 0.
0 ∂z2
But this term is not 0 because this is W g (z2 ) ∂U , and this is not 0. Why is
that, because z2 = W h1 + U x2 , and it depends on U .
So this is very similar to what we did with this, there is a recursion there,
okay. So, in all 3 cases, well in both the cases, in the case of W as well
as U , you have a dependency which is actually sitting there which will make
you back propagate through time. You cannot simply find out the gradient
of L3 with respect to W or U , without finding the ∇L3 with respect to this
throughout time. And this is why sophisticated expressions exist. When we
look at, in the next couple of videos when we look at deep RNN’s, you will
see that this issue can actually become a little bit more complicated.
Which is why tensor flow for example has a full graph. So all these depen-
dencies are resolved in terms of graphs by using automatic differentiation.
And back propagation uses automatic differentiation, one sort of the other.
So, that is it for back propagation through time. The basic idea for you was
to see that training can be a little bit more complex. In practice, unless you
are writing some new architectures by yourself or entirely new architectures,
something that nobody has ever thought of, you will not really be doing this
by hand ever. But this is slightly important for you to see or at least get an
intuition of what is happening in and as well as to see the next video where
we will be dealing with vanishing or exploding gradients, thank you.
11
935
Vanishing Gradients and TBPTT
Welcome back. In the last video you saw back propagation through time,
which we called BPTT, we also found out how to write the loss function. In
this video we will see that what issues come out of defining the loss function
this way and what issues come out of trying to do back propagation through
time with a constant W . Okay, remember that the basic issue because of
which we had to do back propagation through time was because U , V , W ,
our matrices were constants across time, because of which you had sort of
recursive expressions for the loss with respect to W and the loss with respect
to U .
Okay, in this video I will tell you what the 2 issues are, which come up.
So the 2 main issues that come up is gradient calculations either explode
or vanish. So now both of these can happen. Both of these are not ideal.
Now, the other thing is the gradient calculations are expensive. You will see
that each of these issues has a different solution. The solution for exploding
gradients is something called gradient clipping. The solution for vanishing
936
gradients is alternative architectures, specifically in the next few videos not
this one, we will see 2 of the alternate architectures called GRU and LSTM.
As I had said earlier, you should see them optional methods for RNN’s,
just like you had specific architectures in CNN’s, such as Le Net, Alex Net,
etc. And finally for expensive gradient calculations or expensive back propa-
gation, we have something called truncated back propagation through time,
T stands for truncated. So these are the 3 issues that we will be looking at
within this video or at least we will see the origins of these and I will give
you a brief introduction to truncated back propagation through time. Now
all of these are very the context of you getting a flavour of how all RNN’s
differ from ANN’s and CNN’s.
There are of course like everything else in this course, there are a lot of
details within this which we will not be able to cover within this course. This
is basically an overview course and we will try to give you an overview of
this, so that you can understand some issues when they come up, when you
try to train RNN.
937
So let us look at a typical RNN with, let us say this is the time axis,
as before this is the input, this is the hidden layer, sorry, this is the hidden
layer and this is the output. Let us say this is x1 , h1 , ŷ1 , then this is L1 ,
this is ŷ2 , this is L2 , so on and so forth, let us say if this is ŷT , this is LT
and we sum all these up, as we saw in the previous video, to get the total
938
T
P
loss, i.e., L = Lt , okay. For gradient calculations, remember when we do
t=1
∂L
something like, W = W − α ∂W , if you are doing simple gradiant descent.
T
P
This term has to be calculated as Lt . Further we saw that you cannot
t=1
simply calculate, let us say if I have L3 , I cannot simply calculate ∂L ∂W
3
in
∂h3 ∂h3
the usual way, it involves something like ∂W and we saw that ∂W , in turn
∂h2 ∂h1 ∂L
involves ∂W , which involves ∂W . We saw that this is true for ∂U also. This is
basically what we call back propagation through time, because none of these
terms is independent. Now this kind of dependency creates several problems.
So let us look at a heuristic description.
Okay, I want to point this out, this is not exact, this is not exact way of rig-
orous mathematics, but this is heuristic. Heuristic means rough, okay, that is
a sort of a handwaving argument, it does go through mathematically also but
that is well beyond the scope of this course. We will just give you an intuition
for why this kind of thing causes a problem, okay. Now note that when you
have something like ht , our general expression was, ht = tanh(W ht−1 + U xt ).
Now I am going to say a few things, so please notice this, this goes ap-
proximately, assume that you can somehow ignore xt , you cannot but let us
say for now that you can ignore it. So then this goes as tanh(W ht−1 ). Let
me further ignore this tanh, then I will say ht goes as W ht−1 . What does
this goes as mean, order of magnitude, okay. So you are trying to guess how
much is ht affected by ht−1 . So you can see that ht scales as W ht−1 .
Long long ago, in week 1, we saw ideas of something called eigenval-

ues, eigenvectors, etc. Now I am going to make a further approximation
very shortly but please do recall that. So, if ht is W ht−1 , then ht+1 would
be W 2 ht−1 . Why? This will go as W ht which goes as W 2 ht−1 . So, in gen-
eral ht+n will go as W n+1 ht−1 . Just to be clear, I will use this as W n ht . So,
the weight matrix as you go through time, okay, so the weight matrix keeps
on constantly multiplying. So h3 would be so-called W 2 h1 and if I have let
say something like h5 , that will be W 4 h1 so on and so forth.
939
Now suppose, again all these are heuristic arguments but it is already
the remarkably good approximations, unfortunately I cannot show you this
within the scope of this course. So, but if I have norm of ht+n , notice ht
is a vector, remember h is simply a vector. But if I take its norm, which
norm, let us say 2 norm, it does not matter which norm it is, it will be some
factor times norm of ht . Remember norm is a scaler, so this is a number,
you are trying to find out the size of ht+n , that will be some number times ht .
940
And it turns out that it scales approximately as the eigenvalues of W n .
Another way to see this is volume that W is diagonal, I have W is diagonal,
all it will have, W n will simply be, all its eigenvalues or all its diagonal terms
to the power n. Now which eigenvalue, we will see shortly. It will either
be the largest or the smallest. So the worst-case scenario is this will be the
largest, the best case scenario or the smallest case scenario is if this is the
smallest.
Now what all this argument, you know, we have made about 6 or 7 ap-
proximations here, but all this is supposed to show is that there is a scaling
going on here. As long as I use the same W , okay, as long as I use the
same W , which I do for RNN’s, throughout time, what happens is these
vectors constantly get larger in magnitude or constantly get smaller in mag-
nitude. And like the example we took in an earlier video of the sensor case,
if you have a large number of times sequences, I called layers but you can see
it as a time sequence.
Okay, so if you have a large number of time steps, this number, even if it
is small, you know, for example even if it adds to 1.01, over time it is going
to get to be a huge number. This is the power with, this is the power of the
exponential function or of the power function. So, this, if λ > 1 or |λ| > 1,
then as n increases, ||ht+n || tends to ∞, okay. Tends to ∞ means more ap-
propriately becomes very large, okay. All this remember is simply heuristic.
Similarly if λ < 1 , this makes as n increases, ||ht+n || actually tends to 0.
Now this is simply for h, you can show that and I would request you to
try this out by looking at the expressions in BPTT, the similar arguments
∂L
hold true for ∂W also. That is to say, if you remember, if I look at ∂L
∂W
3
, the
∂h3
expression in the last video, this depended on ∂W . And if you look up the
∂h3 ∂h2
expression, ∂W will look something like W ∂W . There will be other terms
0
there but this will be a primary term that is sitting there, you will have σ ,
etc.
But it depends this way, so in that sense, this is very similar to this re-
lationship, except it is backwards in time. So if I look at the gradient of
this, this will be W times the gradient of this, so on and so forth. So if
you back propagate in time, you are going to get a similar exponential effect
that too. So, this case of either the gradient growing, if you have something
∂L
like ∂W increasing or tending to ∞, this is called a exploding gradient. And
941
why is this gradient exploding? Simply because the same operation repeated
continuously, okay, back and back in time can actually lead to large numbers.
∂L
Similarly, ∂W tending to 0 is called a vanishing gradient. Now why are
these problems? These problems because obviously you are never going to
get exactly ∞ because you are still dealing with finite number. But recall
our discussion in week 3, we have finite precision. If you have finite precision
and you have numbers that are growing ever larger, at some point you are
going to, in some sense if you actually did the calculation by hand, you will
have a large gradient and you will move to an entirely different space, if you
could actually calculate it.
But the problem is the moment it goes about the largest number that
your machine can calculate, it will actually show you NAN, not a number or
it will show you ∞, so on and so forth. So really speaking, finite preciation
machines cannot handle exploding gradient. Similarly you will never actually
go to 0. If you do like 0.991000 , it will be very very very small number. But
the problem is it might actually becomes smaller than 10−16 , which is the
smallest number that you can represent accurately.
So at that point you will no longer train, so that will be called saturation,
as we have said several times before. So you will get a very small gradient
and that is practically gone. There is another problem, which sits here, which
I will not discuss in details, but you notice this tanh, even the tanh is be-
ing repeated multiple times, okay. So you have ht = tanh(W ht−1 ), ht+1 will
be tanh(ht ), so you have tanh2 , similarly you will have tanh3 .
942
Now, remember tanh itself looks like this, tanh2 will look like this, tanh3
will look even flatter. And if you take tanh1 00, it will look even smaller and
notice in all these cases, gradients become flatter and flatter and flatter and
they become very small. Now, all these problems put together lead to these 2
issues. The tanh, repeated tanh problem will lead only to the vanishing gra-
dient issue but large number of players can either lead to exploding gradient
or it can lead to vanishing gradient, both of these make training very difficult.
Why does it make training very difficult? Because we train through gra-
dients, that is really how we train. So because we train through gradients,
either exploding gradient or vanishing gradient is a problem, okay. Now how
do we handle this? As I have said before, if I look at exploding gradient, it
turns out that very sort of simple hack works really well. This trick is called
gradient clipping. This is sort of a numerical hack. It is very simple, we
decide on a maximum allowable gradient size.
What do I mean by value of gradient? Again remember, gradient is a

vector, so you cannot give it a value, you can however give a value to norm
∂L
of gradient. So let us say we are dealing with ∂W , let me call it g vector. So
I will say that maximum value allowable of g vector is some Gmax . You will
decide it, okay, you will decide on what you are comfortable with. Just like
our cut-off criterion, this is an arbitrary criterion set by you, it is sort of an
engineering solution to the problem, okay.
943
So, how does your algorithm looks? So you are doing gradient descent let
∂L
us say. While doing gradient descent you calculate g. Remember g is ∂W , it
∂L
could be ∂U , also similarly for all the values. Now suppose you find out, you
calculate ||g||, If ||g|| < Gmax , you proceed as usual. If not, then what do we
do, we hold onto the gradient direction, we define a new gradient which is
equal to the old gradient. What does this do? Remember this is simply the
unit vector in that direction and you multiply it by Gmaximum .
So what this says is my new gradient is in the same direction as the gra-
dient you calculated but I am cutting down its size. So, if it was this big and
suppose my maximum length allowed was this, I will keep this as my new g
vector, just for clarity I will call it g ∗ . This was original g vector. So this is
gradient clipping and of course you proceed with this new gradient and each
time you hit above a particular gradient. Now this might seem like a simple
hack but actually it tends to work well. So this is what we do for exploding
gradients.
944
Now what do we do for vanishing gradients? Unfortunately no such sim-
ple solution exists. So, for a vanishing gradient, we use different architecture.
I will talk about the basic idea of this architecture and why they work in the
next video. But 2 such architectures are what are called the gated recurrent
unit and long short-term memory. You might remember long short-term
memory from what we actually used in the example problem, in the earlier
video on sensors etc.
Okay, so that is what we use and we will see how to do that in the next
video. Now, finally there was a 3rd problem, which was computation of the
gradient, this is expensive. As well as the solution I am going to give right
now kind of handles to a certain extent, both the vanishing gradient and the
exploding gradient problems, so it is sort of a compound solution. So, if we
go back here to this, let us say that this figure actually represents thousands
and thousands of time steps, okay. And you want to calculate back propa-
gation through time. Now how would you do it?
The way you would usually do it is you will forward propagate through
the whole thing, calculate the whole of LT and then you will back propagate
through the whole thing. Now remember the example that we had before
this, that example had 65,000 time sequences. Now, would you go back for
the full thing and come back through the full thing, by that time almost any
correction you give will lead to vanishing or exploding gradient problems,
10
945
plus it would become potentially very expensive just to do one gradient up-
date.
Now, similar to the thing that we had between gradient descent and
stochastic gradient descent, you can do something similar with back propa-
gation through time and that is called truncated back propagation through
time. So the solution to that is truncated back propagation through time.
Now it is a very simple solution, so let us say I am just going to draw boxes
here, each box represents an input, hidden layer and output. So, let us say
input, hidden layer, output and of course loss, okay, so on and so forth. So
let me just draw it at the end here.
Now, given the sort of time invariant nature of RNN’s, remember that
we are assuming that the relationship is the same and in fact you can cut
it anywhere in the middle and you are going to get exactly the same W ,
okay. Given that, the basic idea that people found out is instead of training
for the whole sequence that you have available to you, you sort of split it
up into many mini batches. You might recall something called mini batch
gradient descent. This is somewhat similar, there are minor differences but
in notion, this is very very similar, truncated back propagation through time.
So what do we do? Let us say just as an example to see here. I will for-
ward propagate through 2 steps, okay and back propagate through 2 steps,
this is one possibility. If I do that, remember that W is the same everywhere.
So I will get some new updated W . Then I start here, forward propagates
through 2 steps, back propagates through 2 steps, my W is now updated ,
okay. Now when the W is updated, I will forward propagate through the
whole thing, okay. So I keep on doing this.
Now, in general what you do is forward propagate through k1 steps and

back prop through k2 steps. The basic idea, okay, so there are many many
implementation details here, luckily you know most packages take care of
this for us. So, just for you to get in notion of how all you can train, okay, so
that you can make up a new architecture, you can think of all these variety
of ideas. All you are doing is forward propagating through part of the data,
back propagating through a different part of the data.
How does this help? If you back propagates through a small amount of
data, your gradient will neither blow up, nor will it vanish. Now what is a
good rule of thumb? It is actually hard to say for some problems, hundred
steps are good for some problems, 10, 20 steps are good, etc. This depends
11
946
on the type of the problem and you will have to experiment with it like every-
thing else. In some sense, a lot of neural networks is still engineering, it has
not come to the level of science as yet. So this basic idea is called truncated
back propagation through time.
A simple warning, as of, you know, as of right now Keras as I understand

it has k1 equal to k2 , okay, so that is you have to forward propagate and back
propagates through the same number of steps for truncated back propaga-
tion through time. So as a summary, when you have RNNs, they actually
have a strong problem. All neural networks and all CNN’s, in fact this has
been the major thing that has kept them back for a long time is they have a
problem with training. It is not just the size of the dataset, but sometimes
either gradients explode or they vanish.
This happens in everything but it happens particularly in RNN’s, be-

cause you have the same W across time. Because of that you can either get
vanishing gradient very quickly you can get exploding gradient very quickly.
You can also get a large amount of computation, especially if you are data
size is large for the kind of examples that I showed you earlier. In the case of
exploding gradients, use gradient clipping, in the case of vanishing gradients,
try and use alternate architectures which I will start from the next video.
And in case of handling computational issues, try and use back propagation,
truncated back propagation through time, thank you.
12
947
Professor Balaji Srinivasan
RNN Architectures
Welcome back in the last few videos we saw that vanilla RNNs can suffer
from several problems such as exploding gradients as well as vanishing gra-
dients what we will be doing in this video is to look at few alternate RNN
architectures they are slightly more complicated than the plane architectures
that we saw in the last few videos so just to recollect what we have done so
far we looked at the plane or vanilla RNN as it is called it is a very simple
idea you have your input coming from here and you also have a previous
layer contributing from here.
So the input layer we call as xt and there is also an input coming from
the previous layer which is ht−1 where t is the level at which we are looking
and the output or the input to the next layer is ht and the output here is yt ,
now the formulation we saw was very similar to what we have been swing-
ing whether with it is CNN or ANN it is simply a linear combination in this
case W and U are matrices, weight matrices, this linear combination followed
by an non-linearity I had also mentioned that the typical non-linearity that
we use within RNN is a tanh layer, Now the output which is optional you
can take out the output at any point.
948
As we saw in the various combination of RNN’s or various character-
izations of RNNs that we saw in the previous videos you have an optional
non-linearity might be simply linear or it could have some non-linearity there
and times yet another weight matrix multiplied by the output at ht , so this
is typically what is defined as yt so this what we call a vanilla or a plane RNN.
As we saw in the previous video vanilla RNNs have trouble with either
vanishing or exploring gradients during back propagation why is that because
repeated operation of this sort as we saw in the back propagation through
time video, repeated operations of that sort can actually make the gradi-
ents either increase continuously or decrease continuously depending on the
eigenvalues of these matrices.
So this make it hard to calculate deep layer and all this difficulty goes
back to the fact that machines have finite precision if you did not have finite
precision even a vanishing gradient would not go exactly to zero and you
will learn something but typically what happens is if the vanishing gradient
become very small and smaller than your machine cutoff or machine epsilon
you are not going to learn anything and your learning algorithm is basically
stuck when it uses some variation of gradient descent similarly with the ex-
ploding gradient it is possible that it might rise really high and then drop
but that would not happen because you will always hit the maximum limit
of machine largest numbers and that will grow up unboundedly.
So you can have various problems but in either case for all sort of practi-
cal algorithms vanishing as well as exploring gradient is a huge problem and
949
you are unable to train deep layers what do we mean by train deep layers,
lots and lots of layers, so typical language task can use to 50 to 100 layers
and even engineering tasks might require that many at least a few of them in
such case you will find it really-really impossible to train this and that was
situation till the 90’s.
So vanilla ordinance were not used very popularly for anything other than
toy tasks, for an exploding gradient there is actually a method as we saw in
the previous videos you can use gradient clipping that is you can hit a max-
imum size of the gradient and make sure that all gradient do not cross this
threshold.
But for vanishing gradients usually there is only one solution which is to
use an alternate architecture, that is you have to tune this methodology of
simply doing g(W ht−1 + U xt and the earliest work on this and still very-very
popular architecture is something called LSTM, which stands for the humor-
ously named long short-term memory.
So the short term sits together and I will explain it shortly so long short-
term memory this is by Hochreiter and a very famous researcher in this field
called professor Schmid Huber. So this is in 1997, so Schmid Huber, has
worked extensively in this area and he has worked in many other area also he
has got phenomenal contributions in the field of machine learning. Recently
there is a slightly simplified version of this called the gated recurrent unit
also called GRU for short.
950
LSTM and GRU for short, and we will look at both this architectures in
this video and coming video in order to see how is it that they will help the
deep layers in training part. I want to mention that it is recommended that
you look at the original papers they are a little bit hard to read especially
the LSTM paper is quite hard to read. We will be simplifying both the ex-
planation as well as the expressions that we will show here in general when
you use RNN’s typically you use either LSTM or GRU where you use still a
matter of heart, but we will give a few suggestions in the next few videos.
So what is the basic idea the basic idea with which LSTM and professor
Schmid Huber had started was the idea human being do not simply sequen-
tial process something so if we say something like Balaji is taking a class he
talked about machine learning, so this sentence in order for he to work out
you need a long term dependency as we saw in the previous videos and this
he relates to the person who is speaking and there is a lot of stuff in the
middle.
Now how the brain does this is using something called short term mem-
ory. Now within the cognitive sciences, as I understand it, there is a lot of
research on how many items we can keep and the general consensus based
on a paper of few decades ago is something seven items is something that
we can keep in our mind in our short-memory these are sort of our small
computational RAM. So, if I give you too many items or if I give you a long
phone number or long number you cannot remember it but you can remem-
951
ber about six or seven items in a short term space which we can immediately
access.
So what is also understood is the brain has two sorts of mechanism or

at least we think that there are two sorts of mechanism there is something
called short-term memory and long term memory, and the algorithm that we
have used here effectively only has a short term memory of one, which is ht
immediately, the immediate value uses is simply that of xt and ht minus one,
but it does not store of the context where ht minus one came from this is a
very rough example or this is a very rough explanation of what happens, so
the basic idea here is to either have a separate cell.
So this what Schmid Huber did in LSTM which is to have a separate cell
for memory and GRU also uses some memory like operations, so just to ex-
plain and I will show this mathematically shortly if you see here, ht depends
on ht−1 and xt , and when I move on to the next unit, ht+1 , explicitly depends
only on ht , and it has completely forgotten ht−1 . There is of course some
portion of ht−1 , hidden in ht , but ht+1 , in general is not using ht−1 explicitly.
So what GRU and also LSTM do, is try to keep some portion of even
prior computations store, some of you who have done iterative methods,
might know something like relaxation schemes or successive over relaxation
that also works on a similar principle as you will see in the formula that will
be shown.
952
So let us come to the idea of GRU, now in order to explain GRU we are
going to use a slightly different version, which we will call simplified GRU.
This is thanks to professor Andrew Ng, just be clear, this is not a practiced
algorithm. So, this is something we are using just in order for explanation,
the actual thing in practice is GRU, I will be showing you, it is a slight mod-
ification of the simplified GRU architecture, okay. So, let us come back to
our original vanilla RNN architecture we had xt , we have ht−1 , ht comes out
so we say in general that, ht = tanh(W ht−1 + U xt ), so this is the output, of
the vanilla RNN.
So, what simplified GRU does, is labels this instead of calling it ht we

will call it g, so g, is equal to tanh of this term, we are going to label this
something as well, remember the linear combination is also always called z,
so we will call it zg , where zg = W ht−1 + U xt , remember W and U are ma-
trices and for a particular reason instead of calling them simply W and U
we will called them Wg and Ug .
953
So remember up until now nothing has changed, this is simply the out-
put of the vanilla RNN, so if this is g what is ht . So, simplified GRU,
works in the following way, we will take a linear combination of the vanilla
output and older computation, so what does that look like. It looks like
this, ht = (1 − λ)g + λht−1 .
So notice this, this is some scalar for now I am going to modify this
shortly. So, if you see this, this is simply a linear combination. This is our
vanilla output and this is my older output. How does this help what we are
saying is I will not simply take, whatever I have got right now but I will
also retain some portions of whatever of whatever competition I did before
954
going back to the example that I used here, what we hope is in the ht minus
one, some portion of this term Balaji is stored. All this is very vague but
much like CNNs all these people, all this algorithm were made with several
heuristics. You will see that there will be a separates cell for memory in long
short-term memory, but I wanted you to get the overall heuristic idea, which
is retain some portions of the old computation also add a new computation.
So when is this a nice linear combination, when λ belong to, is simply

an interpolation when λ is a number between 0 and 1, now this is not quite
simplified GRU. I am going to write the form shortly, but remember what
this functions like you can think of λ, as if it is a valve. The valve is like
this, when the valve is turned to 1, you only retain what was done before,
so ht = ht−1 , it is pure memory, so when λ = 1, you have ht = ht−1 , this is
pure and simple memory, no computation at all and when λ = 0, then what
happens ht = g which is you get plane RNN.
So you can think of simplified GRU, effectively as a linear combination

of just memorizing and not saying anything, every word will mean the same
thing if you are using a language task, or pure RNN, so it is a linear combi-
nation it is sort of an interpolation between a pure memory task and a pure
computation task, so that is what simplified GRU does, so let us look at the
actual simplified GRU exploration, it is like this ht = (1 − f ) g + f ht−1 ,
so this f is just like a valve. This f is called the forget gate, once again
remember when f = 1, you will purely get memory and when f = 0 you will
get vanilla RNN.
Now what is the distinction between here and here, there λ was a scalar
here f is a vector just for the sake of clarity let us say ht was a 50 × 1 vector,
let us say g was also a 50 × 1 vector and ht−1 was also 50 × 1 . Then in
order for the Hadamard product to work f also has to be 50 × 1, now why
would we take f to be a vector because it is possible that you might want
to remember a few thing and forget a things within a vector, within the h
vector. So, what this gives us, independent, I will call them valves for each
component of h. That is all fine but what value of f do we choose and here we
use the general principle of whatever we have been doing in neural networks
which is we never really specify any component we let the algorithm choose it.
So this is, introduce parameters to be trained, it is the same as ANN

weights, it is the same as CNN kernel weights. Similar to that we will as-
sume f is full of parameters that ought to be trained, but remember f also
has to be between 0 and 1, why is that only then does the linear combination
955
work out well there only then does it look like interpolation.
So if I want f to be between 0 and 1, you remember from logistic re-

gression that we have one function that always squeezes, any function into 0
and 1, so the same principle apply here, so we will kee f as a σ of something,
a σ of what, same we have linear combination of only two vector here at
any place you have these two vectors ht , ht−1 and xt and you make them a
linear combination of this. I will keep another W another U , but I wanted
to distinguish it from the W and U I used here, which was Wg and Ug , so I
will call it Wf and Uf .
So you will see this theme repeated again and again and I hope that if
and when you are in the position of making an algorithm of your own, you
remember that there are very simple things that are going into, any sort
of neural network what are the basic ideas that we have seen, so far, lin-
ear combination, non-linearity. So, here is a linear combination here is the
non-linearity. How did we decide on the non-linearity, because we have some
idea of what f should behave like, f should behave like a valve, so I wanted
between 0 and 1, so the non-linearity, I will use is sigmoid.
So let me put it together and hopefully it will become clear then, so

the summary is as follows, I want to say, that ht = f ht−1 + (1 − f ) g,
where g itself is effectively the output of the vanilla RNN, we decided the g =
tanh(W ght−1 + U gxt ). Also, f = σ(zf ), where zf = Wf ht−1 + Uf xt , in order
for us to see this symmetrically. I will even erase this and call it tanh(zg ),
956
where zg = Wg ht−1 + Ug xt , so this is essentially simplified GRU.
Now it is conventional, for us to represent all these new architectures us-

ing a figure. I am going to makes a figure of my own, this might differ from
the ones that are there online or wherever you find in other textbooks but
this is just for a pictorial visualization of what is going on some people find
it clearer to use this some people prefer to use simply the formula that I have
written before, so let us see this so what we have is as before we have ht−1
coming from the previous cell, we have xt as the input.
Now the two combine, so let us take this in the two combine to give I will
call it g itself, so if you take tanh(zg ), you get g, now from g you go a little
10
957
bit further, there is our valve which is multiplied by (1 − f ). Similarly ht ,
there is valve which is f , go further add the two together and what comes
out is ht . This is a straight forward figure, expressing all this here remember
even f is simply sigmoid of zf , where zf is a linear combination, so all ex-
pressions that you will see within neural networks. I have said this probably
100 times through this course are simply one linear combination followed by
a non-linearity. That is also true for this simplified GRU, but as I said before
simplified GRU is not the algorithm actually in practice.
What is in practice is the usual gated recurrent unit, it is a small variation

of what we had in simplified GRU so this I will call the full gated recurrent
unit. So, the expression for that is very similar to that the one that we had
before with some small variations. So, this portion is this same f ht−1 +
(1 − f ) g, instead of (1 − f ) g, which was the vanilla RNN output. This
0
is a modified vanilla RNN output g , and I will write the expressions now.
0
g = tanh(zg )
f = σ(zf )
zf = Wf ht−1 + Uf xt
zg used to be Wg ht−1 + Ug xt and now there is a small change to that.
There is another valve here, this is called the remember gate or the remem-
ber wall. So, instead of simply having ht−1 you add a yet another parameter
there, r ht−1 + Ug xt . r also should be between 0 and 1. Now, what do we do
about r, it should be trivial by now for you to understand. This it is simply
11
958
sigmoid of some other thing called zr , where zr = Wr ht−1 + Ur xt , so is the
full gated recurrent unit. A few comment here, so what is it that we have
achieved by doing this, first thing we have introduce many more parameters
remember in the usual vanilla RNN we only had two matrices W and U , now
you have six sets of matrices Wg , Ug , Wf , Uf and Wr , Ur .
So, now you have increase your number of parameters by a factor of three,
we have and we will see in LSTM that it is a little bit more when you go to
LSTM it is four times rather than just three times, but for GRU it is three
times and if we go back to simplified GRU, it was two times Wg , Ug , Wf , Uf ,
so the fully grated recurrent unit gives you more parameters to play with, but
if it gave you more parameters to play with, with exactly the same results,
then it would not be useful. What happens is, this is works much better
for deeper networks, it does not mean that vanishing gradients are no longer
going to occur, it is just that you are able to train deeper network.
So just as a simple example for certain cases, if you are able to do only
ten layers with a normal vanilla RNN, you could go up to maybe 70, 80, 90
layers with fully greater recurrent unit. That is the first thing which is you
use up some extra parameters but you do get better training, you get better
training, the second thing is you have a little bit of non-linearity which is
sitting in the system which tends to mean that you can get and train richer
kind of architectures using the full GRU. As of now for several applications
LSTM is still the first choice, LSTM is something that we will be covering
in next video it will be a brief video because most of the inputs have already
been covered here.
So LSTM is very-very similar to the full gated recurrent unit like I said
we will have 4 times the parameters but even though it is older, it is still
more or less the industry standard. GRU since it has fewer parameter it
is easier to train but it goes somewhere and between the vanilla RNN and
LSTM, thank you.
12
959
Long Short Term Memory
Welcome back in the previous video we saw some variations of the RNN
structure, using GRU’s gated recurrent unit. GRU of course I a recent ver-
sion 2014 and we saw that by introducing new weights and a slightly more
complicated structure we could probably handle the vanishing gradient is-
sue, so GRU of course was not the first architecture to handle this the oldest
architecture to do that was LSTM in 1997 and this is still the industry stan-
dard in many ways though as you will see the structure is slightly more a bit
more complicated than GRU, but not by very much if you go the ideas in
the previous video the LSTM idea should also be fairly clear.
So what is LSTM, LSTM is stand for as I had said in last video long
short-term video memory, so remember the short term sits together and
LSTM was the first architecture to use the idea of a separate memory cell.
So, when we were dealing with GRU or the simplified GRU we had something
like ht = f ht−1 + (1 − f )g, where g was the output of the vanilla RNN
and the idea there was to retain some portion of your old calculations into
other new ones in the case of LSTM we will be actually using a separate cell
all together. This is a memory cell, so heuristically this is sort of like having
some numbers you retain it in a separate memory bank and while you are
960
calculating with some other numbers, so I will just show you the formulation.
So we write ct notice the analogy with what we had before is f times

this is once again the Hadamard product or the element wise multiplication
product f ct−1 +ig, where i is now called the input gate, f is like last time
it is called the forget gate, so the ideas are very-very similar. Once again we
want f ∈ [0, 1]. Similarly, i ∈ [0, 1], but this gives you only ct , what happens
to ht , ht is the output that we are actually interested in ht = o tanh(ct )
and o once again is another valve or another gate, such that, o ∈ [0, 1], and
it is called the output gate.
So now we have three gates f, i, o, and we also have to predict g so

just to write a summary of how we calculate LSTM, LSTM is calculated
as ht = o tanh(ct ), where c is the memory, and ct = f ct−1 + i g, and
now all this parameters need some definitions and these definition you could
probably write down intuitively even before I do and I would recommend
that maybe you pause the video and try it once just to make sure that you
have understood things but I will quickly write them down in a second so
these ideas are very similar to the once we had used in simplified GRU and
even in GRU. So, the idea is simple since o ∈ [0, 1], you say o = σ(zo ).
Where, z will be a linear combination of what came in. Similarly f =
σ(zf ). Similarly i = σ(zi ), and g being the output of a vanilla RNN is
simply tanh(zg ), now what are these zo , zf , zi , zg . We can write them down
961
pretty easily:
zo = Wo ht−1 + Uo xt
zf = Wf ht−1 + Uf xt
zi = Wi ht−1 + Ui xt
zg = Wg ht−1 + Ug xt
All these put together gives you LSTM.
Now if you see LSTM, it has how many unknowns you know what ever
be the size of the W matrices you have 8 unknown weight matrices, just for
comparison plain vanilla RNN just has two weight matrices.
Now if you recall when we were doing back propagation through time, we
had to find out both these weight matrices including of course the output
matrix. I have not talked about that here, but if you have output you have
to find out back propagation for that. Though the output matrix back prop
as you had seen in the back propagation through time was straight forward.
Now, if you use all these eight you will have to do back propagation for all
eight of these matrices, so thats what will change and if you use simplified
GRU, we saw that there were six matrices etc. and sorry simplified GRU
had four and GRU had six matrices.
So it is just a question of how much expenditure you are willing to bear

, now LSTM typically can train or can retain non vanishing gradient for
greater number of layers compared to GRU and GRU typically can retain
962
greater number of layers compared to vanilla RNN and so this is remember
when I say LSTM greater what it means is the number of layers that you can
train with LSTM the depth of the architecture can be greater with LSTM
compared to GRU and that compared to vanilla RNN and that you have to
balance it against typically the number of weights that you have to train, so
of course this is also true of non-vanishing layers that you can train plus time
taken for computation.
So LSTM will typically be more difficult to train in term of computa-

tion time, it will take greater time for you to train and it will also take you
slightly more time to run because it has more matrices in there, so everything
is greater about LSTM. Now typically a rule of thumb that is suggested at
least in modern days. Modern days meaning just as or four years is that you
try vanilla RNN on task and if it is a small number of layers if that works well
enough good otherwise try GRU on a task and if that work at well enough
good, if not then try LSTM of course depending on what peoples priority are
many people typically tend to use LSTM of the bat.
So that is certainly a possibility, now just to repeat the exercise that we

did with GRU and simplified GRU, I will also draw sort of the diagram for
LSTM remember now that we have not only ht−1 and xt coming into this
box, which is finally going to spit out ht , but you also have your memory cell
or memory computation so we also have ct−1 coming in, and ct going out and
so on and so forth.
So, ct progresses ht progresses, and there is some processing that happens
963
inside which was given by our formulation above. Now what was that ht−1
and xt combine as usual to give our vanilla RNN output. Now, ct−1 , if you
remember there is valve here, it looks like ∞ but it is actually valve. So, i
gets multiplied by g, the forget gate gets multiplied by ct , the two combine
and this is what gives us ct as output, at the same time the same ct comes
down here you run it through a tanh, run it through the output gate and
what you get is ht .
So this is simple schematic of LSTM. Now this can be shown an very

different complex ways but I like this because this kind of tells you what
the mathematics is doing sort of in a simple way. You can also see several
version of this online, each person has their own diagram of LSTM GRU etc.,
I prefer this you do not really have to learn it is for those people who prefer
visualizations to arithmetic or algebraic formula. So, LSTM is the industry
standard for RNNs, you can blindly use LSTM more or less today for any
RNN task that you see fit, a warning is that it takes a long time to train for
many-many most of the language task that we are interested in thank you.
964
Why LSTM Works
Welcome back in the last video you saw LSTM and it is architecture, we
also saw in the previous video simplified GRU and GRU, now remember are
all these were meant to handle the vanishing gradients issue, now the ques-
tion is why is it that GRU and LSTM work, so in this video we will give you
a very-very short and very heuristic explanation. The mathematics of this as
far as I understand has not yet been totally worked out, so this is basically
guess work. Initial guess behind LSTM, was based more on cognition rather
than any direct mathematical reason, but I will try and give you a short
heuristic of this work.
So remember that when we had simplified GRU our expression was, ht =

f ht−1 + f g, where, g = tanh(zg ). Now this was the expression that we
used, now how does this help the vanishing gradient issue. Now remember
why was it that the gradient was vanishing in there first place, remember
you can sort of think of this as weight matrix and if this weight matrix is
multiplies itself multiple times through multiple layers there eigenvalue when
it raise to the power n and if it is less than one it can actually go to zero, that
was the basic problem. When this goes to W n it went like λn , as I explain
in the gradients video. Now, how does this term help notice that when this
is W , this vector or this matrix can be approximated as if it is the (I − W ),
965
if this goes as W again remember all this is very heuristic, if those goes as W
that becomes I − W .
So if this number is small this becomes correspondingly large, if this is

0.01 that becomes 0.99. So, in some sense this term and that term balance
out more importantly this plus is what makes things work why it is plus
makes thing work because you can now visualize this.
As if you might recall this from Dr. Ganapathy’s video. This is nothing
but the architecture of ResNet and there whether it was ResNet or whether
it was AlexNet and several other cases you actually saw that there is an
alternate path way for the gradient that is when you are doing back prop it
can either go directly through this or it can go through this.
So, similarly when you offer remember when we were doing LSTM, we
had one pathway through ht minus one, we had another pathway and this
was the reason why we drew the figure through ct. So, this alternate pathway
for the gradient actually helps you, again this is a very heuristic explanation.
Whenever you actually provide alternate pathway, especially jumps from the
end to the beginning if you actually jump a few layers, one way or the other
or you provide different paths as was provided in AlexNet through different
GPUs when you do that it typically sort of mitigate gradient problems.
So this is a general theme that you will see across this course, so this is a
good lesson to learn you know, sort of a heuristic lesson to learn. Whenever
you have training problems try and provide alternate pathway try and pro-
vide some skips connections try and provide some different way to actually
966
train, and that is really what as we understand it what happens even within
simplified GRU or within LSTM because of alternate raise or mathematically
because this f sort of balances out the one minus f or the i gate in terms
of LSTM it actually gives you different ways to train the gradient, and the
gradient goes longer before vanishing thank you.
967
Deep RNNs & Bi-RNNs
Welcome back in this video we will be looking at again at a very very

low level at deep RNN as well as bi-directional RNNs the deep RNNs are
particularly important in language especially Google translate for example
uses deep RNNs at a certain level, so let me write that down so the Google
translate that you will see if you go to translate.Google.com. We know be-
cause Google has published a paper that is uses, at some level it uses deep
RNN’s. Now, what are deep RNNs let us look at just one of these if I look
at one of these within the RNN it is just an ANN as we saw with normal
RNNs. In a normal RNN all you had was one input layer, one hidden layer
and one output layer. In a deep RNN all you do is that one single layer of
the RNN actually become a deep neural network that is the only difference
between a deep RNN and a normal RNN.
So now each of these could by themselves be LSTM’s etc. So we are not

going to discuss that but let us assume that each of this are actually deep
networks now all that happens here is there is a connection between each
layer or each time sequence. Remember in this direction we have time and
in this direction we have layers. So the number of decide how deep it is.
Though sometimes you know, by abuse of notation even I have said this is
depth. This direction is not really depth, depth actually is along the layers,
968
so this is either time or sequence etc., okay. So now what is it you can think
of this as if it is a single deep neural network unrolled, that is all it is un-
rolled and it is the same structure repeated again and again and of course
the weight are also always the same. Now what is the big deal about deep
RNN?
So, obviously they let you deal with more complex structures. Now if I
look at some arbitrary unit, let us say this unit. So, let us say this represent
level one, this represents level two and this represents level 3 this of course
is ŷ and of course this is time unit 1, time unit 2, time unit 3 time unit 4, so
if I look at some element let me say this element, so let us draw that element
now this is the one I am looking at right now, so this is h this hidden unit
this is time sequences let us denote time sequence as t and by superscript let
us give the level this is two or in general this is going to be l.
So in this specific case this element this element would be h23 , now what
comes in, is the previous one now as you can see time decreases here, so this
is ht−1 , at the same level and what come in from below is instead of x, which
is what used to happen in a normal single hidden layer RNN. In this case,
in a deep RNN what is below is actually hl−1 t . So, when this is the case we
need to write the general expression remember our general expression would
be hlt = tanh(W hlt−1 + U hl−1
t ). Now there is one small catch of course with
each level.
So this one will have W 1 U 1 , this one will have W 2 U 2 , this one will
have W 3 U 3 , that is the difference here. So, in this case we would say that
you have W l U l . So, instead of one single W and U , which is what we used
for an RNN or a normal RNN, you will have multiple W s and multiple U s,
that is the only different between a deep RNN and a usual RNN. Now, the
other thing you can see, is of course when you do back propagation through
time or any back propagation it can get fairly complex because the gradient
pathways are multiple to go from here to here, you could go this way you
could go this way.
You have all sorts of ways, this is basically what tensorflow and things
like that actually make a little bit easy, they will draw a graph of what the
dependencies are of each thing on everything else and they will automatically
calculate the gradient for you. So, deep RNN are extremely useful as I said
in the beginning especially in language tasks but above and beyond, what we
have already discussed, there is other than computational complexity there
is no real notional complexity above and beyond what we have other than
969
multiple W s and multiple U s.
The second thing that we look at is something called bi-directional RNNs,

sometimes you might even find the term Bi LSTM, which simply is a bi-
directional RNN using LSTM, rather than the usual RNN. These might be
deep or not deep really does not matter. Now what are these four these are
four tasks that actually are sequential but the sequence can go both backward
and forward that is not only does the future depend on the past, so to speak
but the past also depends on the future. Now, what would be an example
of that let me give you a very-very simple example though you can think
of several thing even in an engineering problems. I will come back to that,
so suppose I write something of this sort and you have a optical character
recognition tool which basically means this is handwritten and just like we
saw with Mnist you want to recognize a handwritten digit.
Similarly, you want to recognize what is this word, now suppose the way
usually RNNs will do it is, this will be input one this image will be input
two, this image will be input three and this image will be input four. Now
suppose, I go only in one direction it will read S, it will read O and it will not
know whether this letter is T or whether it is F. So, the probability of this
will actually not be known, because you are only seeing that particular letter
and the past letters. Now however as a human being if you see and identify
this letter as T very clearly you can actually go back and correct yourself. In
fact, if I am not sure Microsoft editor equation editor uses this, but you can
actually see it, if there is now and option in Microsoft equation editor called
970
ink, where you can write things by hand it will actually go back and correct
what it said before.
So if I actually read both backward and forward this is usually how we

read even with our eyes he sort of guess what the middle letters are based on
what happens at the end in such case you will need a bi-directional reading,
that is you will go this way read it, you will go that way read it, and sort of
the joining of these two is what tells you what each letter is. It is not only
what happens in one sequence direction or the other sequence direction.
So you can in fact, see this even in the sensor problem that I told you
about, suppose you want to guess whether a person is sitting or not and you
are at right at the beginning of a signal. If you go back and look at that
video you will see signals like this, but if I am right at the beginning of the
signal, how do I figure out what it is a part of. At the beginning you actually
have to guess by what happens in the future to see what the meaning of the
first term is, so this is also an example even though we did not really use
Bi LSTM’s there, but typically this is a good use case there too. Now, how
do we actually do bi-directional LSTM it is a small tweak over the usual RNN.
So let us look at a figure, I am going to reduce our boxes to circles here

once again let us say this is x1 , this is h1 and this is yˆ1 , so usually we
would go in the forward direction and you would usually write something
like ht = tanh(W ht−1 + U xt + b). Now what we do when we are doing bi-
directional LSTMs or RNNs, is add an additional vector remember x1 , x2 , x3
971
are fixed these are simply our inputs. If you look at the ”soft” word, this
would be S, this is O, this is F, and let us say if you have x4 , that would be
T, but I have a choice on what I can do for the hidden unit.
So not only do I add a forward vector, I add a reverse vector also and I
will say ht in the reverse will draw the opposite vector is tanh, now h2 will
not depend on h1 but it will actually depend on h3 , the h2 reverse vector.
So, this I will say is W . We will call this W forward, call this W reverse,
even though these are not vectors you can add additional weights ht plus
one vector plus U remains, you add another correction vector here plus b
inverse vector. So, now you have added three new parameters this is just like
what we did with LSTMs. This is forward parameters, even though, these
are not vectors I have called them forward, just for you to see it. You have
inverse parameters now what about ŷ, if I look at ŷt , ŷt will take an input
not only from here but also from here so ŷ usually used to be simply some
non-linearity of V times ht .
→→ ←←
Now we are going to make ŷt = g(V ht + V ht + c). So, now you see here
you have the bias unit, you have these two vectors, you have these two, one
another bias, so this is six parameters and nine parameters totally, so 9. So,
both in reading documents for getting out speech sometimes you can figure
out what I am saying after I say a few words, so you can translation actually
requires a bi-directional task, you cannot translate a full sentence until you
know the full sense of the sentence.
So you want to go back, as well as you want to go forward, as well as you

want to go back, in such cases bidirectional RNNs are useful again over and
beyond what we have said most of the other things are simply unrolling the
graph and doing back prop. Otherwise there is no other difference from what
we have done so far, so in this video we looked at both deep RNNs as well
as bi-directional RNNs these are just small tweaks depending on particular
which use you want to put it to, this depends on you, these both are alternate
sort of architectures for RNNs, thank you.
972
Summary of RNNs
Welcome back this video summarizes what all we have learnt this week
about RNNs, RNNs are recurrent neural networks, they are basically used
for variable size sequential data. In engineering problem whenever you have
time series and especially equally spaced temporal data RNNs can be tremen-
dously useful at least they are very powerful. We still are discovering use case
problem sometimes there is course on and there is a whole series of techniques
for time series analysis and that might be sometimes in fact there are some
recent papers that argue that time series analysis is actually much more pow-
erful than let us say LSTM or usual RNNs for engineering problems.
So, this is still something that is still being discovered but typically this
would be the best case scenario if you have equally spaced time series later
then sometimes RNNs can be extremely powerful, so whenever you have
variable size sequential data you use RNNs the basic structure of an RNN
is simply an ANN repeated, which is why sometimes you will see this fig-
ure x, h looping on itself, basically means if you unroll it you have x, h, x, h
going sequentially again and again, we usually we use only one set, if it is not
deep you use only one set of W, U, V . W denotes the connection with the
973
previous h, U denotes the connection with x and V denotes the connection
with the final predictive layer.
We saw that we have gradient problems, the gradient problems can ei-
ther be vanishing gradient or they can be exploding gradient. For vanishing
gradient we use alternate architectures, specifically GRU and LSTM and
for exploding we use gradient clipping. We also saw that we have because
this U , V and W are the same, we use back propagation through time and
sometimes it can get very expensive because you are just back propagat-
ing through the whole thing, which is why we sometimes do truncated back
propagation through time. In fact, for when you have a large number of
sequential steps you use truncated back propagation through time. Finally,
we also saw that you have slightly different more sophisticated versions of
the same thing even using LSTM you can either use deep RNNs or you can
use bi-directional RNNs.
Now, all these sequence of technique can be applied to several problems

and they are being applied to several problems especially language tasks.
Within engineering and science this is still something that is in development.
CNN and ANNs have actually got already got mature uses in engineering
problems. RNNs are still developing, as far as engineering problem are con-
cerned, we have not found too many uses for that, we will show you one more
application in week 10 of this course, thank you.
974
Doctor Ganapathy Krishnamurthi
Introduction Week 09
Hello, and welcome back, here is a, we will start of week nine, with a
small introduction to some of the topics we will be looking at.
So far we have been looking at deep neural networks or deep learning

algorithm starting with plain artificial neural networks. what we will look
at in this week or a collection of algorithm, very powerful algorithms which
have, which are which have been used prior to this deep learning what you
call deep learning era, which is last 5 to 8 years.
So these algorithms are again still very powerful again. They retain many
of their advantages. So, we will look at these algorithms, the following al-
gorithms. We will look at K nearest neighbors algorithms for supervised
classification, binary decision trees, binary regression trees, bagging, random
forests which are again these few, these sets of algorithms. Now they are
related because many all of them use the binary decision trees or binary re-
975
gression trees, right, in some form. Most of them do.
And So, we look at them as one single block or module but of course there
will be separate lectures videos. We will also look at some unsupervised learn-
ing techniques which includes K-means and agglomerative clustering, okay.
So these are very powerful techniques they still again the, many implemen-
tation of these techniques are still available in platforms like Python and R
and you are welcome to experiment with them. They are useful, used quite a
bit in data analytics, still used. And their performance for instance, random
forests prior to the advent of this deep neural networks success of lets say
the Alex net, random forests were really one of the powerful tools used in
medical image analysis and many other data analytics tasks, and even now
they are being used in many cases, they are some of the best techniques to
turn to. They are well studied lots of extremely optimized implementation
of many of these algorithms are freely available for use in your application.
So, we will move on to the lectures, I have given a particular order here
and we will try to stick to the order in the lectures also, okay. Thanks.
976
K Nearest Neighbours (KNN)
Hello and welcome back, in this video we will look at K Nearest Neigh-
bours, one of the more simple classification algorithms. All the figures or
graphs illustration in this in the slides that we are going to see a provided
by Intel software.
So, lets look at this dataset, it has two features, its the cancer dataset.
One feature is a number of Malignant nodes or these are the cancerous lumps
or lesions that are seen in the patient, maybe in the images. The other fea-
ture is the age of the patient and what we are trying to look at is the survival
of the patient, so did that patient survive or not, okay. So this is one piece
of the dataset.
So, all the red points correspond to patients who did not survive and all
the blue points correspond to the, all the blue spheres, correspond to the
patients who survived, okay. So, what we want to do is to predict, when a
new patient comes in, new patient data comes in. So we have the age of the
977
patient and the number of Malignant modules for that patient and let say
it falls here in the dataset. And we want to predict whether the patient will
survive or not, okay.
So, how does K-Nearest Neighbour go algorithm go about doing that,

okay. So, it has one hyper parameter if you can call it, its the neighbourhood
count called K, thats why its called K-Nearest Neighbour. So what we do
is we look at the neighbourhood of this point that where we are where this
test data is located and lets say we consider, only one neighbour, okay. One
neighbour in the sense, we consider the nearest neighbour, right, the nearest
neighbour.
We will later on see what the, what do you mean by nearest neighbour
but we assume that we have some way of figuring out the nearest neighbour
and so in this case the nearest neighbour is a red circle, which a red sphere,
which a which represents patients who did not survive. So hence we classify
this patient as not was not survived, okay. So this is when K = 1.
So, we can also see when K = 2 we have two points which are closest to
it in some sense and one of them is blue and other is red so then its a tie. So,
this is difficult to classify that way. Then we consider lets say three nearest
neighbours wherein three other nearest neighbours out of this 2 says that
patient will survive and one says that patient will not. So then if we take the
maximum vote we take the majority vote and then we can say that, okay, the
new data patient that came in indicates that the patient will survive, okay.
So we can of course keep increasing the number of neighbours in this

fashion and in this case again when we consider 4 nearest neighbours will see
that three out of the four points correspond to patient who survive, so hence
we can safely say that the patient will survive or I mean at least predict that
the patient will survive, okay. So, this is the basic algorithm, so what we
have are the two parameters, one is the K value which tells you how many
points to consider, here when we say points to consider we are only looking
at the training data points, right.
So we are given a dataset which we considered as a training dataset and

we will only look at when say, new data point comes in, test data point
comes in, when we say distance we are looking at only distance to the points
which are present in the training data. So, its not like continue, I means only
depends on how much training points you have, okay.
978
So, in that sense there are only two parameters here one is the K, okay
and the other one is the distance metric, okay. So we will see how to choose
K and what kind of distant metrics are typically used, okay.
979
So we will consider the two extreme cases here, when we consider only
one neighbour at a time and the other where we consider all the neighbours,
when we say all it means the entire training dataset and not the continue
as the graph implies, okay. So now if we consider only one data point at a
time, then what we do is we can see that you know if you look at this partic-
ular plot there is a bunch of red points here which indicates the patient did,
corresponding to patients who did not survive and bunch of blue points here
which correspond to patients that survive. There is an odd blue and red data
point here. So what do this curve implies? So this is the decision boundary,
so this red curve or we can look at as the blue curve whichever way you look
at it, its a decision boundary, which means that anything to the one side of
the boundary means that all test points that fall on that on this side of the
boundary, means the patient will not survive and all test points that fall, fall
on the side of the boundary means the patient will survive.
So, it depends on, so we draw this boundary by looking at the distance

to the first nearest neighbour, okay. So, which all these says that if there is a
data point somewhere here very close to the boundary, red boundary, thats
the and if we put any all the data points very close to red boundary, all these
data points would correspond to patients who will not survive and of course
we move on to the other side it give corresponding data points where the
patient will not survive, okay.
So, we can actually construct this particular decision boundary by consid-

ering by changing the number of nearest neighbours. So, what will happen if
we consider all the data points in the sense instead of considering one neigh-
bour we consider lets say in this case this has about I think close 30 odd
points, okay. And we consider all 30 points as neighbours and then see.
If we do that, then by virtue of their being more patients in the dataset

who survived than those who do not survive, any decision, any point you
throw in will be, any data point corresponding to the patient thrown in will
give the result that the patient will survive, okay. So, this is the parameter
that we will have to learn to tune, okay. So, this is the extreme cases, K
equal to 1 we are considering only one neighbour at a time, again remember
when I say neighbour we are only talking about the training data point that
are made available to you.
So, now the question is how do we choose this K, right. So then that
solved by splitting your data into training, validation and testing so maybe
in the ratio 70%, 20% and 10% of your data, okay. So based on the perfor-
980
mance in the validation data for a particular choice of K you can then decide
on that, choose that K. So, for instance you can vary K given your split of
your training dataset, you can vary the value of K maybe you can go from
1 to let it be 10 or 100 whichever depending on the size of your dataset and
find out for the K value for which you get very good performance on your
validation dataset and of course you can go ahead and do the testing on it.
The other techniques would be to try out n-fold cross validation, right.
Here you will choose different splits of your data into training, testing and
validation or you can just to training and testing. And then of course vary
K for each of these splits and see for the K value for which you get a low
variance and reasonably good accuracy. So, this is your standard way of
making sure that you dont over fit, okay. So because it is very easy to over
fit with K nearest neighbours like we saw here we just do K equal to 1 you
can get a very sharp boundary but of course that means that you will be
over fitting.
So, by splitting your data into training, validation and testing, you can
figure out the value of K using your validation dataset or by doing n-fold
cross validation and for every fold you can try out different values of K and
find out the value of K for which you get low variance and reasonably high
accuracy.
So this is of course a very simple method in the sense that you dont ac-
tually do any training Im in there is, you just load all the data into memory
thats of course a problem and then the data size become very large. So, you
would consider all the data points at the same time that is of course you
after you split it into training, testing and validation and then you just find
out the nearest neighbour, okay.
981
So, one more thing to clarify is what you mean by nearest neighbour. So,
typically we will use the Euclidean distance. So, in this case lets say we have
these two features number of Malignant Nodes and age. So, if we want to
figure out a distance between your test data point and one of the training
data points, we just have to calculate the difference between the feature val-
ues, square them and add them take the square root of course, now of course
remember that when we do this we have to make sure that we do data nor-
malization. This you should understand and goes without saying, because if
you look at the number of Malignant Nodes it is going to be different in the
sense the range of these axis is going to be different from the range of the
age axis, okay.
So then it makes sense to normalize them to make them more meaningful.

So, you do the, you can do the Euclidean distance provided that you have
than the data normalization and in which case the Euclidean distance would
make sense to some extent.
982
Of course, we can also, so that is the L2 distance, this is the L1 distance
we can consider wherein which is just the sum of the absolute values of the
difference between the feature values. Once again here in this case again we
expect that data normalization has to be done before we can compute these
metrics, okay.
983
We can also do multi-classification using K-Nearest Neighbours. Of course
its the same procedure and there is again the possibility of tie that can hap-
pen, okay. So, then you have to vary K to the point where this ties dont
happen that often. So to summarise, we have look that the K-Nearest Neigh-
bours algorithms one of the simpler algorithms and in many cases it will work
very well depending on your data, the basic principle is that you load the
entire dataset along with these features of course after splitting it into train-
ing, validation and testing.
And for a new incoming data point you find out, of course this is a su-
pervised technique so you actually know the ground truth. So, for a new
incoming data point you find out you decide the nearest neighbours to con-
sider so you find out the K-Nearest Neighbours and do a majority voting
among them to find out the class to which your test data belongs to, okay.
So, we will look at other classification algorithms in machine learning in the
other next few lectures. Thank you.
984
Binary Decision trees
Hello and welcome back, in this video we will look at binary decision
trees. Introduction to basically binary classification trees.
So we will look at this data set, which has basically each row is a data
point. It is collected over 14 days, so there are about 14 data points, it is
just to understand this data set. there are 4 features, I can see here Outlook,
temperature, humidity and wind. And you can consider this our training
data and based on this 4 features and data points provided the decision has
to be made regarding whether to play tennis or not, right?
So you can see all the Yes, the decision Yes is marked red and the deci-
sion, Nos are marked in black. So we have 4 features about 14 data points
and based on these features we have to decide whether to play tennis or not.
So how do we go about doing that using decision trees is what we are going
to look in this video.
985
The idea is to again reiterate this to make a decision as to whether to
play tennis or not based on the features provided. The way you go about
doing that, split the entire data based on the features provided. Eventually
leading to a decision as to whether to play tennis, okay. So we start at a
node, this is what is called a node which contains all the data set, okay.
And then we do this check whether the temperature is greater than or

equal to mild. Just recall the temperature has mild, cool, hot, go back and
check here. So there is hot, mild and cool are the 3 values that the tempera-
ture can take, categorical values. So the temperature is greater than or equal
to mild, we decide to play tennis otherwise we decide no, okay. So that the
idea.
And we dont have to stop with one question, we can go ahead and split
further, so we can check whether the humidity is normal or not. If humidity
is normal then we play tennis or if otherwise we dont play tennis, okay. Now
remember that at each node, what is basically once we start at this node
with all the data, okay. And as we split the data based on this based on this
condition temperature we will take all the data points in a dataset where the
temperature is greater than or equal to mild and assign it to this node.
And all the data points where the temperature is cool we will assign to
this node, right? This is what we want to do. And from here on again among
this data points that we have assigned to this node we will check if the hu-
midity is normal or not, if it is Yes, then we will assign it to this node here.
Otherwise if the humidity is not normal then we go there, okay.
So this is the root node, right here this is the root of the tree, that you
have created and the leaves, the nodes here are the leaves. So, the idea is,
that the actually go down this tree or if you go up this tree which depending
on how we interpret it. The leaves would expect to have all the data that
has similar decision. So, we expect all the data, where we decide to play
tennis to accumulate in this node and all the data where we decide not to
play tennis will be accumulate in the other node, okay. So that is the idea
behind building this tree. So how do we go about doing that?
986
So just to summarize, we start with the feature in a dataset and make a
split based on the feature, okay. So, the split could be continuous or binary
or categorical. So, the split is based on some condition, okay. We will see,
what that condition is later on but just for the sake of the few slides we will
assume that we have a way of figuring out which feature to choose and how
to make the split, okay.
So we choose a feature and we make a split on the data into 2, so leading

to 2 different node, so we assign the split data into each of these nodes, right?
987
And we continue splitting the data based on the features, so we can choose
another feature here again understanding that there is a way to choose a par-
ticular feature and make further two splits, okay.
So, we can keep doing that, we can keep splitting until we come to a point
988
where all the leaf nodes are pure, so these are the leaf nodes, right? These
are the leaf nodes and in the sense they contain only one class. So, all the
data in those leaf nodes contain only one class or when we reach a maximum
depth, okay. So we decide that we will only go up to depth of 3 or 4, okay.
So, then we stop after the depth of 3 or we can evaluate a performance

metric based on which we can stop the tree from growing further, okay. So
this is how we build the tree and this is using the training data but how do
we test it? So in the sense that when the tested data arrives, the pass it
through the tree and we will expect it to take a path through the tree and
end up in one of the leaf nodes, and each of those leaf nodes correspond to a
class and that class will be assigned to the test dataset, okay. So that is how
we would go about building a binary decision tree, okay.
So then here comes the question how do we make the split? Okay. So the
split can be, usually the split is based on something called information gain,
okay. We look at what that is again, but we can also use classification er-
ror and typically these 2 would correspond to what is called the information
gain, the entropy and the Gini index. What we would call the information
gained criteria?
989
We can also do this split based on classification error. We will look at
some examples of how we would go about doing that, okay.
So lets say, we are at this particular node where we have 8 data points
for which the outcome is Yes, that is to play tennis and 4 data points where
is outcome is to not to play tennis that is No, okay. So, then we decide to
make a split based on the temperature, okay, greater than or equal to mild,
so which results in 6 data points for Yes, to play tennis and 2 data points for
No, and in this case, equally distributed 2 for Yes and No respectively, okay.
So, we still havent decided how to make the split and what features to
choose, but we will we will see as we go on how to do or go about doing that.
In this particular example is just to illustrate why using the classification
error is a bad idea. So how do we, this is the equation for the classification
error, E(t) = 1 − max[p(c)], where, c is the class, c denotes the class. So,
c
number of classes can be more than 2, so typically for a binary classification
there were 2 classes but there could be multiple classes also for this kind
of algorithm. So, 1 − max[p(c)], is what we would like to evaluate as the
c
classification error, okay. So lets see if we look at this particular node here.
8 4 8
So we have, 1 − max[ 12 , 12 ], so which leads to 1 − 12 , as the classification
990
error, right? So this p(c), is basically the proportion of data belonging to
class c. So the p(c), is just a proportion of data belonging to class c, so
then 1 − max[p(c)], means that we have to calculate, lets say in this case,
c
we have 2 classes, if we have N classes, there will be N number here and
we have to choose the maximum of that and 1 minus that will give you a
classification error in this case it is 0.33, right? So then, the question is how
are we actually deciding on the feature and how are we actually do the split?
So in this case we will see how that is calculated.
So, what we are to do, is then we have to look at the left and the right
node. So we are looking at binary decision tree, so usually there is only 2
way split, okay. So, if you look at the left then the classification error again,
using the formula 1 − max[p(c)], will give you 1 − max[ 12 , 12 ], that will give
c
you 0.5, okay.
991
And if you look at the classification error on the right-hand side, then
its a max of 1 − max[ 86 , 28 ], so that is easily chosen you get 0.25. So, what
we typically want to look at is the change in the classification error. So we
want something the split that would give rise to the biggest decrease in clas-
sification error. So, what we do is after the split we calculate the weighted
average of the classification error.
Here, the weighted average corresponds to the weight average of the 2

nodes here, okay. So you can call this the parent node and these 2 like 2
child nodes. So you have to calculate the weighted average of these 2, so
then how do we do that then we will see if you look at the classification error
change. So this is the classification error before the split, okay.
Now, we want to do it after the split. So the weighted average in this

4
case, so 4 out of 12 data points went to the left node, so that weight is 12
times the classification error on the left node. The weighting for the right
8 8
node is 12 , that is 8 data points on the right node, 12 ∗ 0.25, and we will see
that if we do that then the classification, the change is zero.
So this can happen when we use classification error as our metric to de-
cide the split. So the general idea is to consider one feature at a time, and
calculate the change in classification error for all of the features, based on
split. So the idea is to consider each feature at a time and split the data
992
based on the feature.
And calculate the change in the classification error and choose the feature
that gives rise to the maximum change in the classification error. However
it is possible that when we use the classification error as a metric then the
change can sometimes be zero, so thats what , so that turns out to be a bad
choice. So we will see why that is so in the later slide, but so to get over this
we have 2 other metrics that we typically use that is the Gini index and the
entropy.
So we will look at entropy first.

P So this is the formula, which must be
familiar to most of you, H(t) = p(c)log2 [p(c)]. So what we want to do is,
c
calculate the entropy before the split based on the feature and calculate the
weighted entropy after the split, based on the same feature and see what are
the changes.
993
So lets see how that is done. So entropy before is basically the, there
8
are 2 classes, so you have to sum over the classes. The 2 classes are 12 , or
Yes and No. So the p(c) is just the proportion of data points corresponding
to the class. So 8 data points out of 12 correspond to yes, so that, entropy
8 8
is − 12 log2 12 and the proportion of data points which correspond to the No
class is 4.
4 4
So its − 12 log2 12 , so that gives rise to 0.91, then we go about calculating
the entropy on the left which in this case, − 24 log2 24 , that is about 1, okay.
And the entropy on the right-hand side again if you look at the right-hand
side this is the one thats it is there. So there are 8 data points, there 6 of
them correspond to Yes, 2 of them correspond to No.
Then we can just give − 68 log2 68 and − 28 log2 82 , okay. So there is summation
over the classes and so this is basically summation of the classes p(c)log2 [p(c)],
okay. So there are 2 classes, there are N classes there will be N terms in that
sum, okay.
10
994
So we can do the same thing and we can calculate the entropy change
as 0.0441. So this is one class which is the temperature. We would do the
similar calculation for the other 3 classes, other 3 features and we will see,
we will calculate the same entropy change for all of them and choose the
feature that gives rise to the maximum change. So there are many questions
here because, in this particular example I have chosen b of, all of them are
category. Features are kind of categorical, since there are they take from a
finite set of values. So it is either mild, so we can split based on anything,
right? So it can be we can say less than hot and things like that. So to
arrive at that answer then we have to evaluate this. Even for a particular
feature, we might have to evaluate this cross entropy for different values of
that feature.
So say if you have a continuous feature lets say you just have tempera-
ture, measured temperature in Fahrenheit or Celsius then we have set the
right threshold, on that temperature which gives rise to the largest entropy
change, okay. And then we make a split based on that, so there are different
ways of calculating that, alright.
So in our case we have a very nice categorical labels and we can always
choose among them, to see which gives rise to the maximum entropy change,
okay. So this is typically how the split based on entropy works.
11
995
So lets see why, we had this problem with classification error, wherein,
we saw that there was no change, okay. So that happens because lets say
that you have your parent node, had you know some entropy here, okay. Had
entropy value here, lets say the child node ended up having error there and
there, okay. This is the possibility.
And we see that the weighted average can end up going setting with the
entropy of the parent node which means that there will be no change and
there will be no successive. The tree wont be able to grow, the tree any fur-
ther. However if you look at your, the entropy formula which is −plogp,Pthis
is the entropy based split, okay. And we also have the Gini index, 1 − p2 .
Now remember the summation is over the classes, okay, okay. So the
maximum value of the Gini index can go it is 0.5. You can verify that. Same
thing for the classification error the maximum value of the classification er-
ror, I will call this CE is also 0.5. For the entropy, for a 2 class problem,
okay. For 2 class problem maximum entropy is 1, okay.
So if you plot all of them, then for the Gini index it is kind of very concave
too wiggly, but you can say it is a much moreP 2smooth function then I drew
here. So you can plot that out it is, 1 − p . You can see that how you
can plot and if you want to look at the entropy, it is something more like
12
996
this, because the value can be 2, 1. They all reach their maximum value at
p equal to 0.5 for a two class problem, okay.
So the maximum entropy happens when in a particular node all the classes
are equally distributed, okay. So then that what gives rise to maximum en-
tropy and for a 2 class problem you can understand when that happens then
p becomes 0.5, okay. Thats a possibility.
So there is an easy way to figure out when the maximum happens. So

for instance remember that the entropy formula I will use that E = −plog2 p,
okay. Think of p as a variables then you want to find out the maximum value
of p, then you just take the dE
dp
= 0. If you do that you will figure out that
happens when p equal to 0.5, okay.
You can do the same thing P 2for a lets say a 2 class problem for GinidGindex.
Where, I call this G = 11− p . Once again you can do the following dp = 0,
you will get p equal to 0.5, okay. So that is very easy to calculate but the
point is that if we use the classification error, because it is kind of has this
linear trend. When it is possible to end up with the same total change in
classification error is zero and then there will be no way to grow the tree any
further, okay.
But then if you use the other 2 entropy and Gini index then you can suc-
cessfully split on the different features. If you use the entropy and entropy-
based information gain then it is called the ID3 algorithm and if you use
the Gini index it is called CART. So this corresponds to ID3 and this one is
CART. There are different algorithms for doing this.
So one way of looking at this is, this is a greedy search, alright. So at that
particular node, you are looking for the best split instead of considering the
entire tree as a whole, you just split based on the best top criteria available
at that time, so it is like greedy algorithm.
13
997
And to summarize, so based on a stopping condition, we looked at dif-
ferent stopping conditions one of them being the depth of the tree, okay.
At each node you find the best possible split of data based on attribute or
feature, okay. So you select every feature, threshold that feature, okay. Se-
lect the threshold based on that feature and calculate the information gain
between the child nodes and the weighted average of the child nodes and the
parent the node.
And whichever gives you the highest change in gain information or high-
est change in that criterion, you will do the split based on that feature and
that threshold, okay. Once you have that split then you will assign the data
to the leaf nodes, 2 leaf nodes or add 2 leaf nodes and you assign the split
data to each of them and then again start from each one of those leaf nodes
you again start doing this, okay.
So at test time to keep doing that till you reach a particular depth or if
all the elements in the leaf nodes correspond to a particular class then you
stop at that point, so keep going till that point, right? And for test whenever
test data comes in you just pass it through the tree. You start at the leaf
node and you keep based on the criteria you use at that particular node, you
will keep passing the data point to successive child nodes till you get to a
leaf, a last leaf node which tells you the class, okay.
14
998
Now it is possible that at some of the leaf nodes might not be pure, okay.
So you might have this impure leaf nodes, right? So in which case you just
assign it to the, you do this majority voting. So you, if one of the leaf nodes
lets say 8 Yes and 1 No, for instance whenever the test data point and up in
this particular node you will say Yes, okay.
So thats a possibility. So thats what we have to do. This is a brief on how

to use decision tree for a, binary decision trees specifically for a classification
problem. However, we can also use decision trees for regression problem is,
right? And we will see how we can go about doing that if you have a regres-
sion, its possible. There is one more aspect of this binary decision trees that,
they tend to overfit.
So if you keep growing the tree, so you keep adding leaf nodes or child
nodes to every node that you have till you get maximum purity or in the
sense till you get to leaf nodes which have only one class, then what has
happened is that you have grown a tree which fits your train data accurately,
okay. And the test data might not generalize very well, okay. So then thats a
problem with trees because you can always build a tree to overfit your data,
if you enough depth, okay.
So to prevent that there are methods to what are called pruning trees.
So once you have finished growing a tree then using the algorithm then you
go back and see if you could have stopped earlier and then you throw away
some of the leaf nodes and you stop, so you cut down on the height of the
tree, so that is a possibility, as far as in general binary decision trees are
concerned, okay.
So in the next videos we will look at binary regression trees also and see
how they work. Also give you a brief mathematical motivation to how do
we define a cost function for this binary decision trees that is something also
that we would like to look at and once we have done with the regression tree
and as well as the cost function for the binary decision trees, we will see what
are the other ways to address the overfitting problem.
So we will specifically look at bagging, boosting and we will also look at

random forests, okay. Thank you.
15
999
Binary Regression Trees
Hello and welcome back. In this video we will look at binary regres-
sion trees. The material in this video inspired by the textbook elements of
statistical learning by Tibshirani.
So binary trees, try to solve the regression problem by dividing your in-
put feature space recursively into widgets and assigning a constant value to
the output if the input features fall into that particular region, okay. So lets
consider this example where we have a continuous response Y and an input
vector X with 2 dimensions. In this case we will refer to them as X1 and X2 .
And of course the task is to predict Y given test inputs X1 and X2. So
assume that you have given a training data, and of course we will see, how re-
cursive binary splitting of input features space gives you the desired results,
okay. So lets start with the feature space X1 , X2 there are 2 dimensions,
right? So we recursively split by choosing a threshold along X1 , so we will
call that t1 , okay.
1000
So now once we choose this threshold t1 , we see that the input feature
space is now divided into 2 regions, one to left of the blue like and one to
the right. We can do a further split, so we take the region on the left and we
can do a further split by choosing a threshold along X2 call that t2 , right?
And we go to the right again, here where we split this again by take that by
considering threshold t3 along X1 .
Of course, then we look at again once when we choose t3 , then this region
is split into 2, so we choose the region on the right and then once again we
can split this region into 2 by choosing another threshold t4 , okay. So we get
about, if we go we will get R1 , R2 , R3 , R4 and R5 , right? So what I have
not specified so far is, how do we determine when to stop, right? So how far
do we go? How often do we keep splitting?
So every time we land up with 2 regions after each split and we, and
each of those regions we split further into 2 and we can keep doing that till
a specified point or till a specified criteria is met, okay. Now once we have
these regions, how do we determine the output, right? For new test data.
So how do we determine the output? Lets say we have test data some X1 , X2
some value is given we are going to put tilda there so that, I am confused. So
if X1 , X2 , fall in region R1 , right? You can plot that, so this is X1 , and X2 ,
this is your region. R1 into which it falls then how do you determine, what Y
would be, okay. So thats what we are going to look at now.
1001
Here this is the model determines Y. So what this formula says is that
you consider the region in which X1 and X2 fall into, okay. So lets say it
belongs to region 5, right? Here I is an indicator function, so it returns true
for the region into which X1 and X2 falls into and what it says is, you just
assign a constant value cm , so lets say c5 and that constant value is returned
as the output.
So that will be the same output for all X1 , X2 that fall into that particular
region even the test data as well as the training data, okay. So lets see how
we can fit this into a tree like structure, so that it becomes more obvious. So
we start with all the data points, so we have capital N data points, lets say.
And we will consider the first threshold at the top node here, so each of the
circles is a node.
So we consider with the top node, lets say X1 ≤ t1 , so this is the thresh-
old t1 here, okay. And it splits into 2 like we saw in the previous slide and on
the left-hand side we will look at X2 ≤ t2 , so that will become, t2 giving rise
to R1 and R2 . I might have to reverse things here. So this will be R1 , R2 ,
right. Okay, so we take the region on the right-hand side.
So X1 ≥ t1 , so that region we split again into 2 by considering X1 ≤ t3 ,

so we will get R3 here which corresponds to this region, okay. And on this,
so that gives you 2 regions one to the left of t3 , one to the right of t3 and the
1002
right region we once again split by considering X2 ≤ t4 , this is t4 , okay. So
this will be R4 and R5 .
Lets be consistent with how we, if this is true then it is R1 , if it is greater

than, then it is R2 which is correct. So, thats the order in which we have
divided the input feature space into regions. So, now we see how we can
fit this recursive splitting into a binary tree, so next step is to see how to
actually grow this tree.
So what I mean by growing this tree is, determining how we choose this
features to split, so in this case I just chose to split with X1 in the beginning
but what is to stop you from using X2 , lets say, okay. So how do you deter-
mine which feature to choose and in fact the threshold also, this t1 , how do
you split based on t1 ? why is t1 is so special? The 2nd problem to solve of
course is to figure out what this cm are?
In this case we have seen subscript m, I just said some constants that
we assign to any input features that land in that particular region Rm . So
how do we determine cm and how do we determine the number of partitions,
okay? So, that is the problem that the tree growing algorithm solves, okay.
So to grow a regression tree, we will consider this problem where we
1003
have N data points (xi, yi). The notation here, slightly differs from the pre-
vious slide and just to make simpler. So but in this case you have to consider
the full problem, so you have N data points (xi, yi) with each xi being p di-
mensional. What I mean is that, so you have yi is the output corresponding
to input xi .
But each of the xi are p dimensional in the sense, so lets say x1 , so y1 is the
output corresponding to x1 , and x1 itself is p dimensional in the sense x11 , x12 ,
so on and so forth up to x1p , okay. So the input vector is p dimensional,
so in the previous case you are looking at 2 dimensions, right? So here its p
dimensional more general case. So the tree growing algorithm determines
the split variable, which of these should be split?
So if you write this down xi1 , xi2 up to xip , right? So we have to deter-
mine which of these features we have to choose, is it 1, 2 or 3? Or the pth
feature and what is the split point itself? For in that feature, in this case
to put in the context of the previous slide what should be the value of the
ts that we use as thresholds, right? So the basic model is, is to model the
response that is the output wise as a constant in each region.
So we determine which region X falls into and we have a constant, which

describes the output in that region, we assign that as the output, okay. And if
we actually formulate this as a Least square problem. Lets say the number of,
in the sense that if we know the number of partitions
beforehand then we can
PN
provide the least squares problem as, L = Yi − cm I((X1 , X2 ) ∈ Rm ) .
m=1
2
So, this is lets say we will call this L. So, L is what is your cost func-
tion, right? So we are looking at basically, [Yi − f (x)]2 , right? Here f (x) =
N
P
cm I((X1 , X2 ) ∈ Rm ). So, here if we know the number of partitions m,
m=1
alright. So we know that we are going to split the input region into m and
we actually know, what the m region are.
Then the cm s are easily determined by solving this optimization problem.

Now if you take the derivative with respect to the unknowns, in the case cm ,
then its easily shown that cm s are nothing but the average value of Yi in each
of the regions, okay. So we know the output corresponding to every Xi that
falls into a particular region Rm .
And so in the training data, so we take a mean of those points that will be
1004
the output assign, okay. So if we know the number of partitions beforehand
and if you pose this as a least squares problem, then the response at each of
the partitions is nothing but the average of the response of the training data
falling into the particular region, okay. So thats the solution.
But now the problem is, that we actually determining the number of par-
titions beforehand is computationally very difficult problem to solve, because
you can see that there are so many combinations that are possible as per the
number of regions in order to optimize this particular least squares cost func-
tion. So a greedy strategy is adopted.
So how it is solved? So what we do is, we start off by solving the test

where the recursive binary splitting comes into play. So we chose a particu-
lar feature and we determine the best split on that feature, so the best split
point for that feature, okay. And what feature we choose and what are the
best split points? Is determined by solving this optimization problem, okay.
So for the sake of charity lets say we have we have chosen feature j and
its split point is s, so if its a continuous variable you can think of s as a
threshold, like we had in the previous slides t1 , t2 , t3 and t4 , okay. So j is
your feature index and s is your split point or threshold for that particular
feature. So if we split your entire data space, based on that particular feature
1005
then we will get two splits, two regions R1 and R2 , okay.
And so then we can write down the loss function for the optimization
problem. So this has, the inner and outer optimization loops, so the outer
one is the one that determines the best j and the best s for that particular j, if
you look at the inner optimization it is actually trying to figure out what C1
and C2 are? In the sense that, so we have, at every node in the tree that we
saw earlier.
We have all the data the input, training data and we decide that there
are only going to be two partitions and what we have to estimate now is
that, what is the best feature and what is the best split point for that parti-
tion, right? By optimizing this loss function. Thats all we need to do. And
it turns out that C1 and C2 like we saw earlier is nothing but the average
of the points that fall or average outcome of the points that fall in that region.
And here is the average again outcome of the points that fall into R2 ,
so thats the solution for the particular split, okay. So turns out that, we
can determine j, since this problem is easily solved, the inner loop is easily
solved. We can scan through all the features and find out the best split point
and the lowest cost function for the split point and we choose the one which
gives you the least cost function.
So once we split at a particular note then we are left with two other two
regions and then we adopt the same procedure, we go to each one of the
regions and then split them again into two, okay. So the next question is
where do we stop splitting? Can be just keep going on? Typically the way
it is done is, the stopping criteria is not very well-defined.
So usually the tree is grown to predefined depth and then there is pro-
cedure called pruning, which helps to bring down the depth of the tree. It
is easy to see that as you increase the depth of the tree then it is easy to
overfit, right? Because you can always split your input space into smaller
and smaller regions, so your training data will fit perfectly of course your
test data. There will be a lot of error in your response.
So in the other next videos we will look, at how binary trees are used for
classification problems, thank you.
1006
Bagging
Hello and welcome back, in this video we will look at bagging, continuing
from our lecture on decision trees. So all the figures, in this presentation are
provided by Intel software and the material is inspired from introduction to
statistical learning by Gareth James.
1007
So based on what we saw last time, about decision trees they have a ten-
dency to over fit the data that is it performs very well on training data but
new test data sets arrives and the errors are huge. So, basically it is a high
variance model, it is another way of looking at it, so it is high variance. So,
one solution to prevent over fitting in decision trees was to prune the trees
but it helps reduce variance to a certain point but beyond that it does not
help really and the effects are not very significant, ok.
So bagging is a procedure that is developed to improve the or to reduce

the high variance of decision trees.
We will look at how it does, so the idea behind bagging is to train a lot
of decision trees on a given data set, ok. So, we train a multitude of decision
trees and combine their predictions to reduce the variance, right. So, if you
have, this is based on this is very simple concept that let us say you have
data points z1 , . . . zn , and your variance on individual data points is σ 2 , so
if there are n data points and if you look at the mean, i.e., z1 +···+n
zn
, the
σ 2
variance of that mean is n , right.
So as you increase the number of data points, your variance and the mean
comes down that is the concept that is exploited with bagging. So, you train
a lot of trees on a given data set and you take the average of these predic-
tions, if you are training aggression trees or you combine the predictions for
classification some way or the other.
1008
So where do we get the data set for training these multiple trees, right
because we have only one data set. So, ideally what we would like to do is to
have multiple measurements and with each measurement, in the sense that I
say measurement you collect data 100 times and with every data you collect,
you train a decision tree. However that is typically not possible because you
are generally given a data set or only a set of data is available and you have
to grow your tree based on that.
So the way to train multiple trees is to grow decision trees from boot-
strapped samples, ok so what do you mean by bootstrapping samples? That
is what we are going to look at, so there is this data set, this is about movies
and their budget, the turnover, total gross, who is a director of the movie,
the rating and so on and so forth. This is a movie data set, so we are trying
to make some decision about, this based on the data. So, we will not go into
that detail I just to understand the data set.
So the way we go about doing bootstrapping is to select a subset of this

data with replacement. So, if you look at this particular so the blue area is
the sample data, so the idea behind bootstrapping is to sample your given
data with replacement to create a new data set. So if you look at this partic-
ular movie database detail. So, there are 17 data points and we are selecting
a subset of this data marked in blue right with replacement, so we do it let
us say capital B number of times.
1009
So B bootstrap samples are generated and we use each one of these, B
bootstrap samples to train a decision tree and use the output of the decision
trees. Let us say as an average or a voting scheme to get at the desired
result. So the way it works is so if you look at this particular realization
we have a different subset marked in blue here, we have another realization
of the data that is sample of the data with this which is sampling a differ-
ent portion of the data here as well as in this case another sample of the data.
So you can sample different subsets of the data with replacement and
with each subset you can train a decision tree and use the output and aver-
age the output for regression and some voting scheme for the decision making.
So if you do that what happens, what is the nature of the data sets that
we get when we do this kind of sampling? So on an average, let us say if you
have this is, in this plot you see this axis, basically the number of bootstrap
samples. So, each time you select 10, 20, 40 or 80 or 100, samples from your
data and this is the probability of a particular data point not being present
in the data, ok. So, that is given by this expression (1 − n1 )n .
So, if you look beyond a point, as you, as a number of bootstrap sam-

ples increases. So, we see that typically every bootstrap sample contains two
thirds of your original data and one third of your original data is not present
in the samples on an average so that is the makeup of your sample data.
Now this again, this is possible because we are sampling with replacement
so just to reiterate as you increase the number of bootstrap samples you see
1010
that, the percentage of data left out is like what one third, so one third is
not selected. Typically your bootstrap samples are made up of two thirds of
original data, so this refers to the individual data points, right.
So we will use, we will exploit this fact to do error estimation later on

but we will just see how we can go about doing this using these bootstrap
samples for training decision trees and obtaining an output.
So we have, let us say in this illustration. We have about multiple trees

ok and each of those trees are trained on one of the bootstrap samples, boot-
strap data sets and there is an output corresponding to each of those trees.
Let us say this is a classification task, so each one of these trees outputs.
Let us say it is a binary classification, so we are looking at either red or
blue right, so each one of these trees outputs for a particular data point, it
outputs a decision.
So two of the trees said red and one tree said blue, ok. So, then you can
so for a classification task you can just do maximum or the voting scheme.
So, since two red gets the maximum number of votes, you would classify that
particular data point as red, ok. So similarly we can do this for all the data
points. So, for every data point, so each one of these columns is a data point,
so every data point each of these trees will output a certain category red or
blue and we assign a category which gets the maximum number of votes.
1011
So as you go along the rows, as you go along the columns. Each column
you will select the category, which got the maximum number of votes, ok. So,
this is for a classification task, so in the end you will get a single classifier, ok.
There is also for instance regression, wherein you will just get the, take the
mean, mean or the average, ok. So, there are many ways of going about this,
if you think about it let us say think of a classification task then based on
the output of the individual decision trees. You can formulate these output
probabilities which is nothing but so let us say there are three classes then
you can have the probability of class 1, class 2 and class 3 and each as the
fraction of the trees, which gave this as output class 1 as output, right.
A fraction of trees that gave class 1 as output, so p1 there is it, I mean

you can do it this way but the better way would be to actually get the raw
probabilities from the output of the tree itself right, so for instance each tree
will output a certain class with the probability that you can calculate, ok,
based on the leaf which is it is aware to which it is assigned and you can ac-
tually take the average probability over the leafs of every one of those trees
and use that as to classify into a particular class, ok, but this is even though
this is trim you know appealing to do and the better approach would be to,
directly estimate the probability from the output of each of the individual
trees and then average those probabilities instead.
In this case we are considering the fraction of trees that give a particular
class as output ok. So, in the end so what are we doing we are bootstrapping
our data set and with each bootstrap sample we are building decision trees
and aggregating the output of those Bayesian trees. So, it is bootstrap, so
bagging is nothing but bootstrap aggregating, ok. So, that is the idea behind
bagging and the what it improves is the high variance. So, it brings down
the variance in your model because decision trees tend to over fit.
So having multiple, a multitude of decision trees strain on the similar

data will give you a better average output.
1012
So one of the ways of calculating the error or validating your machine
learning algorithm as we have seen earlier is the cross validation, that the k
fold cross validation or n fold cross validation. It is kind of difficult to do as
you can imagine with the bagging approach. So what would be the typical
approach taken in for when you are using bagging, right.
So we saw that, about one third of the data samples are left out on an
average, when in every bootstrapped version of your data set. So, once you
have created a tree based on a subset of the data you can measure the error
on the unused samples, so ok. So, let us say you have about 100 trees right,
ok. So, let us say 30 trees do not use data point, some data point I will call
it X4 0 or something. So, let us say you have 120 data points X4 0 is one data
point, and 30 of those trees do not use that. So, then you will evaluate the
error on X4 0, on those 30 trees individually and average them, ok.
So that will give you, so you can for every subset of trees, that do not use
a particular data point you can use those trees to evaluate the result for that
data point and use that average as a measure of the error in your algorithm,
right. So this can be root mean square error if you are doing a regression
task or the classification error in case of a k fold or a twofold classification, ok.
This procedure is called the out of bag error estimation, correct. So this
is one of the advantages of using bagging so you do not have to specifically do
k fold cross validation you just have to identify data points that are not used
in subsets of the decision trees, that were used for the bagging procedure and
you evaluate the error using only those trees on that particular data point.
1013
So, like that you can identify on almost one third of your data set on
an average, as not being used in one tree or the other. So, then you can
accumulate their error overall those data points and calculate an average ok,
that gives you either for classification or for regression.
Similarly, you can do a same procedure for feature importance right, so

feature importance can be either measured using the classification error or
typically the Gini index, will give you the measure of feature importance.
What do I mean by that is, when we are trying to do the split at every node
in a tree we are looking at, we are choosing features based on the largest
change in the Gini index or the entropy you are the classification error.
So, if we can identify, so, it is the same procedure as before on the un-
used data set, but we will use instead of accommodating the error. We will
accumulate Gini index change and average that over all the trees, ok. So,
that way will give you a good idea of which feature is the most important.
So recall that it is much easier to do this, if you have a single decision tree,
because at every node, whenever there is the, whenever there is a split you
know the change in the Gini index or the entropy that you or whatever cri-
teria that we have used to make the split and you can use that as a measure
of feature importance, ok, but when you are doing a bagging there is when
you are averaging over a multitude of trees may be several hundred of them
and then it is very difficult to do that in a straight forward way.
1014
So because the data set will keep changing between trees, typically. So,
then we just have to do a similar procedure that we did for calculating the
output of the tree. So, instead of averaging over the root mean square error
or the classification error you look at the Gini index and you average over
that for every feature using data points that were not there as the training
data or in the data that was not used to make that tree, ok.
So you can recall that when we were talking about when you are talking
about this decision trees we also looked at something called feature impor-
tance, so we figured out that the feature that was the most pertinent to the
task at hand was the one that gave the biggest information gain as measured
with the Gini index for instance, ok. So, this was our way of figuring out the
most significant features in our data, right.
So when it comes to bagging, it is not one decision tree that we are look-
ing at we are looking at what 100 of trees and then it is very difficult to make
this decision how do we figure out which is the most feature, important fea-
ture, right. So, here again we do exactly something similar to what we did
earlier for calculating the out of bag error. So, we will look at the average
information gain over all the trees for that particular feature and that gives
you an idea of if that feature is very important or not, ok.
So, during the process of training like we saw, we take bootstrap samples
and train every tree with it for every tree and whenever we make a split on
the data based on the information gained, we keep track of the information
gained for that particular feature across all the trees, whenever that feature is
used and we take an average of that and that gives you the feature importance
when you are using bagging for classifying, right, or even for regression, right.
1015
So, typically bagging performance increases with the number of trees, ok.
So, if you have about 100 or more stated in many text books and on resources
that per 100 trees should do it, ok, they should be able to get a good average
and since you can reduce the variance, significantly, if you use about 100
trees to fit your data, ok, and of course you can always measure the error
using out of bag error and you can also look at a feature importance just by
even averaging across the trees.
So the advantages of bagging they are same as decision trees they are easy
to interpret and implement, ok. You do not have to do any pre-processing
especially for bagging, right. Whatever pre-processing you do for to the data
10
1016
centering it at zero unit variance those things that you do for the data re-
mains the same, so all that there is no extra data pre-processing or anything
of that sort and of course doing bagging improves the variance, so there is less
variance in your output and training multiple trees can be done in parallel be-
cause they are all done independently, there is no correlation in their training.
So in that way, you know, the training of one tree dependent does not
depend on the training of the other. So, you can train multiple trees simulta-
neously of course you have to implement them in in a certain programming
construct. So, that this can be done very quickly, right.So bagging is a very
useful tool to have, especially if you are using decision trees and many of
the python packages come with it. So you can try scikit learn or any other
python module that offers bagging inbuilt.
So we have looked at decision trees and we also now looked at bagging

which is basically just exploits the fact that, if you average over a large
number of trees then you get a better variance, but the problem still exists,
because if you see we are sampling the data with replacement and the pro-
cess for growing the trees remains the same. So, if you consider a situation
where in a certain feature let us say in this case a director of the movie in
this case, in this data set has a significant impact on whatever classification
that you are trying to do with it and that is this is the most important feature.
So even though you are selecting bootstrap samples of data the first split
is going to be on the director, right. So, that way all the trees that you
trained are correlated and if you average over a bunch of correlated vari-
ables, then you are, then you do not get such a huge reduction in variance,
ok, so that is not great. So, if a data set is like that that wherein a certain
feature has a huge impact on the classification output then it does not really
help.
So to alleviate that problem, we will look at what are called random

forests, ok. In random forests, what you do is the same thing you train a
lot of trees but then you will not only select the data points at random will
also select the features at random and that helps to weaken the correlation
between the trees that you train and improves your output, ok.
We will look at this in slightly more detail and following random forest, we
will also look at a gradient boosting and adaboost. And these techniques are
also generally very powerful for improving the accuracy of whatever classifier
that you might be using. We look at these in the next few videos, thanks.
11
1017
Random Forests
Hello and welcome back, so in this video we will look at random forests
so which is based on binary decision trees.
So we saw that a bagging can be used or bootstrap aggregation can be

used to improve the variance in when you are using binary decision trees
this is by training a bunch of decision trees using bootstrap samples of your
training data, ok. So because binary decision trees tend to over fit having
a very large number of those the variance is reduced and it performs and
improves generalization performance, ok.
However the problem with bagging is that it is possible for trees to be

correlated in the sense that if we have a very strong couple of strong features
which always split first or the most are the most what do you call the best
information gain is obtained by splitting on those features then no matter
how many trees you average over that they are correlated, so beyond a point
bagging will not reduce the error in your predictions, ok.
So this is primarily again to reiterate because there will be some more

few indicator features or a bunch of strong features which lead to maximum
information gain and averaging correlate variables will not help reducing the
1018
variance.
So, in order to deal with problem we do something called random forests

as I guess the name implies it is again a bunch of the decision trees but there
is a difference from how we do with compared to bagging, so with bagging
we grow decision trees from multiple bootstrap samples so this is here is our
training data and the blue highlighted areas what we are choosing as the
training data as part of the entire data set and the blue region is where we
choose the training data from.
As we saw in the previous lectures this is just the movie recommendation

database, so the date of the movie, the title of the movie, the budget so you
have a date, title of the movie, budget the domestic total grass, director of
the movie, the rating in the runtime on given and so we are just choosing
this the subset highlighted in blue as our training data, ok. So, what we do is
we select with bagging we just selected with replacement no random samples
from the training data.
So that is basically we bootstrap the training data to produce. So, if

you have one set of training data we produce a let us say in this case three
sets of training data from the original training data. So, as you see the blue
box keeps moving around inside this white table which is the actual total
number of data sets available to you and with each bootstrapped training
data sample you will fit a tree, a binary tree and we saw and we will, will
not go into details but we saw last video how we for a per given test data set
1019
we will run the test data set to all the trees that we have trained using the
bootstrap data and it will just average or in this case maximum voting for
classification, an average for regression, ok.
So for random forests, again it is the same principle that we will grow
decision trees from multiple bootstrap samples with the exception that even
the features will be chosen at random. So, let us say we look at this partic-
ular tree here which trained using this particular data set, a bootstrap data
set highlighted in blue. We will not use all the features. So, at the root node
I have cut out these these two columns of features, ok, and when you come
to the particular node here again I have cut out some of the features. So, let
us say this I left out the title feature again and grow the decision tree, ok.
So the difference between this bagging and random forests from a very
top level view point is that when we grow these multiple decision trees from
bootstrap samples not only do we bootstrap the training data set but we
also only choose a subset of the features available to us at every node in your
tree, in every tree, ok. So, what happens this kind of reduces the correlation
between the trees that we train, ok, so that is the principal behind random
forests.
1020
So it is a modification to bagging wherein we grow decision trees from
multiple bootstrap samples let us say we have M data points we will have
bootstrap samples M data points and for every tree at every split we will
consider only a random subset of features. So, let us say you have total of N
features. Let us say, D features and typically D is order of square root of N
ok, so at every node in every decision tree you only choose a subset of the
features.
So the learning on the subset reduces the correlation between the trees,
ok. The disadvantage behind using a tree this way is basically building a
random forests from trees which are trained on bootstrap samples as well as
randomly chosen features is that it becomes difficult to interpret, ok, because
every node even though we even if you grow all the trees to a particular depth
every node might be split based on a different variables.
So it will be very difficult for us to interpret which variable gave the best
information gain and things of that sort, ok. However there is a huge im-
provement in performance compared to bagging because we are decorrelating
the trees that we are training, ok. So, we typically would train hundreds of
trees and as we saw in bagging for a regression output we will consider an
average of all the trees outputs of the all the trees and for a classification
output we will do maximum voting, ok.
So this is a brief note about random forests again the fundamental prin-
ciple is a decision tree just that how we train them and how we interpret the
results are the only difference, thanks.
1021
Boosting
Hello and welcome back, in this video we will look at boosting which is
another technique for improving the performance of decision trees most of
the slides, graphics and illustrations are provided by Intel software and the
content is inspired by the textbook elements of statistical learning, ok.
1022
So we saw that in the last lecture bagging improves performance, primar-
ily by reducing the variance and that is accomplished by training multiple
decision trees over bootstrap samples of your data, ok, and the output is
basically the average of all the decision trees, when you are looking at a
classification problem or to take a majority vote if the output is the average
from all the trees when you are looking at a regression problem or you take
a majority vote if you are looking at classification problems.
So in this lecture we will look at boosting which is an alternative to bag-

ging and also improves our prediction accuracy, ok.
So we will take a brief overview of boosting, I just to get an understand-

ing of the algorithm and one of the classifiers used in boosting is a decision
stump it is basically a classification tree with one node and it splits the data
space into two these decision stump are referred to as base learners and if you
want for boosting or bagging we can use these stumps or more complicated
decision trees.
So as I mentioned earlier, the building block for boosting is a decision

stump primarily and it has one node so if you look at particular data set
which has temperature as one of its features, we can split the feature space
into two at this particular node based on a threshold on the temperature, so
it is just a illustration of how the decision space is split into two and this is
your classification boundary.
1023
So the all the data points to the right of the classification boundary be-
longs to one node and the left to the other, ok. So we will use a collection of
these to perform boosting let us see how it is done, ok.
So, how we create an initial decision stump based on one of the features
to split the data space into two, ok. So, just for the sake of illustration here
we are just showing a 2d plot. Of course, we know that an input data can
have multiple features so we are looking at just one feature and maybe this
is a visualization of a couple of features, ok. So, based on the splitting by
the decision stump we have two classes, so again we are looking at a binary
classification problem where the output is either minus 1 or 1, ok.
So we have the decision boundary and we have correct classification of

the red ones, red data points to the left and the correct classification of the
blue data points to theright. These red crosses on the right are basically mis-
classifications. So, then what we do is we adjust the weights of those points.
So, we assign some data weighting to those points. When you calculate the
loss function for decision stump, we do a further classification after assigning
weights.
So think of assigning weights as you know if you take a simple least squares
cost function you will assign a higher cost to these data points compared to
the others or if in fact you can even supress or make the contribution of these
data points minimal compared to the data points that have been classified
accurately to a minimum and you can assign higher weights in the sense
1024
higher loss for miss classification loss to the data points that have been mis-
classified in the by the previous decision stumps.
So then we have this re-weighted data points, which you classify again
to get a new classification boundary. In this case, again all the red points
have been classified correctly, while a few miss classifications for the blue
data points, and then one more time we do the re-weighting and get a new
classifiaction boundary here. So, the output of boosting is basically the sum
of all these boundaries of all these classifiers. So, each one of these, we call
this G1 (x), G2 (x), ok. So, each of these classifier G1 , G2 , G3 , provide you
a classification boundary and based on the classification boundary you have
classification as [−1, 1], the points that are misclassified are accorded higher
weights for the successive classifier.
So most of the time these are decision stumps, that is what we saw in
earlier versions right. So, the next G2 is a decision stump which takes the
same data as input, but the data points which have been misclassified by the
previous decision stump are given a higher weighting. Finally, the classifier
that you use in the end is basically the summation of all the individual clas-
sifiers, ok.
So and in order to improve this, since they are prone to over fit you typ-
ically have a weighting factor here. α that is I could call it. So, this should
be different for each one of them. So, I will call α1 , α2 , so on and so forth to
get your final decision boundary, ok. So the result of boosting is a weighted
1025
sum of all classifiers and successive classifiers are weighted according to some
scheme which we will see in a later slides, ok.
The Adaboost algorithm which is one of the boosting algorithm. So,

Adaptive boosting algorithm it is a very popular boosting algorithm. We
look at it in the context of a two class problem where the output of the
classifier is [−1, 1], is basically the so it is a binary classifier and the result-
ing classifier is the weighted sum of individual classifiers we have capital M
classifiers, each of them can be decision stump.
So we have a weighted sum of these classifiers as the output. So, the sign
of the weighted sum of these classifiers is the output, ok, that is the prob-
lem we are trying to solve. So, how does the Adaboost algorithm progress.
So, initialize data weights to N1 . So, when we start out we assign the same
weighting N1 , where N is the total number of data points, to each of the
individual data points xi , ok, and for capital M classifiers for each iteration
you fit a classifier to the data xi using the weights wi .
So in the first iteration the weights are just N1 , and they are all the same,
ok. So, and then we compute this error term after the classification, which
1026
is given by this expression,
N
P
w[I(yi 6= Gm )]
i=1
errm = N
P
wi
i=1
So, this is nothing but the weighted error term. So, if you look at this
particular expression here this is nothing but the number of samples that
have been misclassified,i.e., Gm (xi ) is the output of the M classifier and yi is
the ground truth, and i is the indication or indicator function, so we are just
counting the number of samples xi , that have been misclassified by this Gm ,
ok.
So, the error that we calculate for this classifier in the M iteration is the
weighted error based on the weights which for the first iteration is N1 , ok.
So how do we update the weights, that is what we will see, we also have to
compute these αm which are the weighting is for the individual classifiers, so
we can calculate αm using this expression, ok, and once we have calculated
the αm then we can go ahead and update the weights.
The weights for the next iteration are given by this expression where again
it is a weights from the previous iteration times this exponential factor, ok.
So this is the loop which is performed M times in the end, we output the
classifier as the weighted sum of the individual classifiers, ok. So, this is just
the algorithmic summary for what we saw earlier in a very illustrative form.
So, in every iteration you fit your data to a what is called a weak learner
or a weak classifier, ok, and you look at all the data points that have been
misclassified and you assign higher weights to those data points so that is
what this step does.
So you assign higher weights to those data points and then you take those
data points. All the data points including the ones that have been misclassi-
fied but then with higher weighting and you if you perform one more and fit
that data again using another classification tree or as or other classifier and
you repeat that as many times as the number of classifiers that you want to
use, ok, since this M is none of the seems like a free parameter it can most
easily over fit, so that is one of the reasons why we iterative calculate the
waiting is so that the learning up happens very slowly, ok.
So over a large number of classifiers you will come to a we we weighted

linear sum would give you a good classification accuracy. The point to note
1027
is that the individual Gm can be weak learner in a sense they perform as
well or may be poorly than random classification, so if you accumulate all
of them over capital Mi iterations, you can get to a very high performing
classifier, all right.
So, where does this idea come from, so this basically has roots in this
additive modelling so when we have this particular model where our classi-
fier is nothing but a linear combination of multiple classifiers, so each of this
G m can be a decision stump as we saw earlier or they can also be regular
classification trees, ok. So now the problem is when we have a cost function
when see that the individual classifiers will have their own parameters and
we also have to estimate this beta m is ok, so how do we go about doing
that? So well then that is accomplished using this forward stage wise addi-
tive modelling, so let us have a look at that.
So we start off with a function initialize to 0 then for M steps so as many

classifiers as we have optimize the parameters of the M basis function or clas-
sifier so the way it is called basis function is that so if you might all be familiar
with the linear regression class that you have been through, so we can let
us say if you are doing regression let us say Y = w0 +w1 x+w2 x2 +· · ·+wM xM .
So this is one of the models that we have seen in linear regression so we

can consider each of these this x, x2 , . . . xM , these polynomials as the basis
functions. So, we can write it in this screen so here these correspond to
1028
the Gm (x), ok, so we can think of each of these classifiers as some basis func-
tion and we are trying to iteratively estimate the parameters corresponding
to each of them and also the weighting that we have to assign.
So there is basically these weight or in this case these x are very this is
a very simple polynomial so we can assign for instance instead of x, we can
use some basis spline functions ok they will have multiple parameters that
we will have to determine, ok. Now when we put this in a cost function it
becomes very difficult to determine, so what we go about doing is in every it-
eration we only estimate that particular Gm , so in mth iteration we will only
estimate Gm (x), ok, without changing the parameters of the first M − 1,
classifies that we have built ok.
So for every iteration optimized parameters of the mth basis function ok,
so which is basically given by this formula here, so argmin with respect to
the βm and γm . L is the loss function for one data point so it is summed
over n data points. So, L is a function of, of course your L, takes as input
your ground truth yi your classifier from the previous iteration and this is
the classifier that we want to estimate ok.
So the β and G or what we have to figure out in this iteration when we

make no changes to the parameters involved in the classifiers and the β from
the previous iterations, so once you have done that then we update the we
update the classifiers as by adding with an appropriate weight is the basis for
the Adaboost algorithm that we have seen, so in this case the loss function
we have not specified exactly what it is so we can actually look at a very
simple loss function let us say we are looking at [Y − Gm (x)]2 , which is your
least squares cost function so based on this this L is just that which we can
write it as I am just going to use xi . I am going to do it for one data point
right because the sum of all of them so which is [yi − Gm−1 (xi ) − βG(xi )]2 .
So we have developed a classifier till the m − 1, iteration right and I

have just substituted this expression here and we have instead of Gm (x, y),
we have just used this expression Gm−1 (x) + βm G(x, γm ), so this is the cost
function that we are optimizing if you use a least squares cost function. Now
you can easily see that this is nothing but the residual from the previous
iteration, so there is it so here we can just call it [ri − βG(xi )]2 , ok.
So if you use a least squares cost function what you will end up doing is
you will fit a particular regression algorithm to machine learning algorithm.
I would say regression tree, so you will fit a regression tree to your data and
1029
then you will calculate the residual so for every xi you will get an output
and you subtract that from the ground truth so we will have residuals and
in the next iteration you will take those residuals and feed them with your
input data, ok.
So that is how it progresses if you are if you use a least squares cost func-
tion now instead of a least square cost function if you use an exponential cost
function what you end up with skating Adaboost that is what we will see now.
The Adaboost loss function or the algorithm that we saw for Adaboost
comes from using an exponential loss functions and the basis functions or
the weak learners or the individual classifiers this can be decision trees or
1030
stumps. So previous remember previously we saw this L(y, G(x) I am omit-
ting the subscripts i we saw that as [Y − G(x)]2 , this is what we use to be as
an example in the previous slide, so instead of that we have an exp(−yG(x))
ok.
So this is this product for a classification problem is often referred to as a

margin right, so you can easily see that let us say your output is a sub so your
category is −1 and let us say your classifiers also inputs −1 then it is greater
than 0 right let us say your category is 1 and your classifier also outputs 1,
then it is again greater than 0, so for all positive values of this product you
have correct classification and all negative values you have a miss classifica-
tion, so that is the margin or you can think of it as the distance from the
decision boundary is what we look at.
So when you introduce this exponential function as the loss function in

order to solve your additive model again to recall we saw that for these Ad-
aboost comes under this additive modelling, so we wanted to get to a classifier
which is the sum of individual classifiers ok right, this is what we sort out
and we wanted to optimize an appropriate cost function to estimate each one
of them.
Now if you want to do it the right way then you would (actua) eventually
we have a loss function that will try to estimate all of the parameters of Gm
and βm in one shot by optimizing a cost function but that becomes very com-
plicated let us say capital M is let us say 100 weak learners then there are 100
sets of parameters for each one of the G0m shereokoreachinthiscasedecisiontree, soyoumightwonde
and they are basically the nodes and the features that was split on ok, so
that you have to estimate for every tree that you fit.
So to elevate this problem we have the additive stage wise modelling

wherein we initially estimate we have we do capital Mi iteration where we
start off with one decision tree typically in this model suggestion tree which
has the one decision tree or a decision stump and you estimate that the pa-
rameters of the decision tree in this case how you grow it and the nodes and
the features and successfully update by, so if you have let us say small mi
iterations you do not change the parameters of the first m − 1 iteration but
only consider the parameters of the mth titration right only update the pa-
rameters corresponding to the mth transition.
So that is this stage wise modelling helps to solve this problem so this
is the easier way to so if you will have a loss function which takes as input
10
1031
the y ideally it will take as input the y and your model right so then you will
estimate all of the parameters for all of the individual decision trees for the
weak learners in one shot but this is a difficult problem to solve so that is
why we saw that we take as input because you are y this is the ground truth
and we have a this is an additive model we coded Gm−1 (x) + βG(x) and we
do not touch any other parameters here but only estimate the parameters
corresponding to this ok, so that is the idea behind using this additive stage
wise modelling, ok.
So for if you use an exponential loss function so this loss function if it is

an exponential loss function exp(−yG(x)) then we end up with the Adaboost
algorithm. We will just take a brief look at the loss function itself I am going
to erase this so that you can clearly see the formulas involved, so the prob-
lem that we end up solving if we use a exponential model is this argmin of
this loss function because all of this to all of the terms are in the exponent
right and you realize that since we saw earlier that the parameter are not at
all affected by this optimization problem because we will only be optimiz-
ing the parameters of G in this case, this γ and β at that particular iteration.
So we can write this as replace this by a weight ok, so this is the loss
function that we end up optimizing for the Adaboost algorithm, so again we
saw some of the update steps there when I outline the algorithm we can cal-
culate the error calculate the error rate m from there we derived an α, which
is related to actually β, that we see here in addition we also did a weight
update so the weights every iteration are updated times some factor, ok so
all of these can be derived by from first principle using these two expressions
all I will do is optimize for γ so in this case we are we are optimizing for βm
and gammam which is gammam or the parameters of your decision tree.
So you figure out the best decision tree by optimizing the misclassification
rate or minimizing the best classification rate and then once you hold that
once you fix that then we can optimize for beta which is which can be done
by taking a derivative of this loss function with respect to beta setting it to
zero and then plugging in the, the optimal decision tree these weights the
update rule for these weights the estimate for beta extra can be obtained by
just optimizing this loss function even analytical expressions can be obtained,
so this is the basis for the Adaboost algorithm.
11
1032
So we look at just to have a look at understanding what this algorithm
actually does, so we look at this red curve here so it is a 0-1 loss so let us say
we have we have a binary classifier which classifies your output as 1 or −1 so
what we ideally want to do is to assign 0 weights, 0 to w, to all the correctly
assigned data points and a maximum weight of 1 to all the misclassified data
points and then move on to fitted fit it with another regression tree or a
classification tree depending on the problem ok but this kind of loss function
is difficult to optimize, so that sense it is replaced by this exponential loss
function that is what Adaboost does, ok.
So the exponential loss function as we saw is the e(−margin) , where margin

makes the Adaboost more sensitive to outliers then other types of boosting,
ok. The theoretical loss function even though is very you know intutive, we
can see that here so really what we are trying to do is we take it as I said let
us say 100 data points in ideal cases what we do, 100 − x, ok, data points
let us say some 30 of them are classified correctly and some 70 of them are
misclassified, ok, then in the next step so this is G1 and G2 , what you will do
is you will take 70 and then fit another classifier ok but this is a this problem
even though it looks very appealing or does not give good results because it
is a difficult problem to optimize.
On the other hand having a exponentially decreasing loss function instead

of this step loss function helps improve your fitting ok, so in the next class
we will look at a further modification to this called gradient boosting so the
procedure we outlined here works very well for the exponential loss function.
However, gradient boosting techniques are good for pretty much every kind
of loss function that you can come up with.
12
1033
So it is a very generalized procedure for doing boosting and it is also one
of the apparently too many sources one of the more popular techniques to
win these kaggle competitions, so we will look at those in the next video.
So if you have a loss function it will take as input your ground truths and
your model just what we did, right and ideally we would end up estimating
all the parameters of each one of these individual classifiers as well as the
weighting factors for each of those classifiers in one shot, but when you are
using let us say decision trees, this becomes a slightly difficult problem to
solve, so we solve this problem in a stage wise manner so that is why we saw
earlier that we ended up with, initially we start off with this model that is
your, L(Y,Gm (x) + βm G(x)), ok.
So this is the model, so what it means is that we will not touch the param-
eters up to in this model or not affected at all in the when you are optimizing
this loss function you are only optimizing for βm and this G at this particular
stage m, ok, and so we saw how this works for the least squares cost function
and instead of the least squares cost function if you use an exponential loss
so what does it look like, right.
So, L is given by this function where this particular term yG(x) is the
margin, ok. So you can think of it as the distance of the data point from the
classifier further away it is the worse the classification you can think of it in
that way or if for a binary classification task you see that, yG(x), is always
13
1034
positive right because G(x), is the output of the classifier at the end of your
class and your classes are −1 and 1. So, yG(x) will always be positive when
your classifier is right and negative when it is not, ok.
So we have instead of the least squares we saw it like this we have put in
the exponential loss function, ok, and in the mth stage we will not modify
the parameters of these class Gm−1 but rather only focus on estimating β
and γ here. So, in the case of regression trees or classification this γ repre-
sent there is a nodes and the features used in the split, ok. So in every stage
you will estimate 1 you will figure out 1 decision tree or stump and you will
add the add that to the classifier that you have already estimated so far till
the (m − 1)th stage right.
So since this is not involved anywhere in the optimization we can (estim)

interpret this as a weight ok if you recall the I can go back and look at the
Adaboost algorithm that the outline then we had this quantities error and
from where we calculated αm , as well as the update to the weights these α
and the β are related, ok. So, α turns out to be, I think β2 , so the estimates
can be derived analytically by fixing β and estimating G, which is nothing
but the best classification tree you can get, so you get the best classification
tree by minimizing the misclassification rate once you have figured that out
then then we can then fix that and then take the derivative of this expression
with respect to beta and we can do further analysis, ok.
So if you do that then we get all the update tools that we saw earlier so
that is the basis of the Adaboost algorithm it is way it starts from this ad-
ditive model right here ok, so it states like a linear basis function model, ok.
So, Gm (x) or basis functions in this case we have used decision trees we can
use other structures also if possible and we and every step we successively
improve the classification or the prediction accuracy.
Of course so far we have only looked at the binary classification problems

it is also applicable to multiple classes and another thing is that we have
only looked at this discrete Adaboost wherein we assume that our classifier
returns minus 1 or 1 we can modify this you know in the case when our
classifier returns a probability values map between 0 and 1 corresponding to
classes −1 and 1 ok, this can be also be done for the sake of illustration we
only looked at the binary classifiers.
14
1035
Gradient Boosting
Hello and welcome back so in this video we will look at gradient boost-
ing, now we have seen that binary trees have high variance and we have
improved classification accuracy by bagging and adaboosting. Adaboost is
what we have look at the way adaptive boosting improves classification by
adaptive re-weighting incorrectly classified points in every iteration. We will
show that we will see we try to explain why it actually corresponds to an spe-
cial case of what you are going to look at which is gradient boosting where
in we consider exponential loss function, so in this lecture we will look at
this more general technique called gradient boosting can be used any loss
function in the sense any differentiable loss function.
1036
So we will consider M data points, training data point each with N fea-
tures, M data points and N feature. We will denote the output of the (class),
I have said classifier but we are going to consider only a regression problem.
So, we look at binary regression tree, let we will consider a binary regression
tree. The output in the regression tree is ŷ corresponding to the input of x,
I have not integrate a subscript here for every data point.
So we will look at this table here we are looking at different iterations,

wherein, we are in going to the successfully update our model, machine learn-
ing model F by learning from residuals. So, in this case we will consider the
input feature xi so i runs from 1 through n corresponding target variable.
The ground truth is yi and we will learn we will fit this using binary regres-
sion tree and we will called that we will denote that by F1 (xi ). So, every xi
the output is F1 (xi ), F1 is our model at this point.
So this is our final model this point F1 (xi ), then we will calculate the
residual which is basically the different between the ground truth and the
output of the regression tree, yi − F1 (xi ), and we take the second iteration
we once again consider the input features for retaining data but then we
will fit it not to yi , but rather to the residual and the model that we use
to fit xi to the residual we will called h1 (xi ). So, then at this iteration the
updated model F2 (xi ) = F1 (xi ) + h2 (xi ). Here, h2 (xi ) is the model that we
use to fit residual okay, and we call we denote that final model as F2 and
then we once again calculate the residual so we can see how this goes. In the
third iteration we will once again consider the input features data point xi , we
will fit it t F2 we will call this model h2 then we will update out model plu F2 .
So this is one form of gradient boosting and this actually corresponds
1037
to we will see that this actually corresponds to using a loss function a least
square loss function. So, I am just going to be without any subscribe or any-
thing I am just going to be write it as, L = (Y − F (x))2 . So, the form of the
loss function we will see that we will see in the few slides that the case, so just
to bring in line with how it usually treated in literature we will instead of in
the first iteration we will saw that the first iteration we are trained the idea
you should trained your a data using a binary regression tree rather in this
case we will initialize the model with by considering the means of a prediction.
So F0 (xi ), for all xi is nothing but one over the number of data points,
so it is just a means of responses, so every response is initialize to the mean
of the responses in the training data it is just a guess in this case and then
we proceed as before and then we call that model F zero of xi, the zeroth
iteration that is what we start of we start of with the guess and we say that,
we will be calculate the residual base on the guess and in the first from the
first iteration onward we will fit it to a regression tree and the update our
model. Once again calculate the residual so on and so forth, so this is how
we will proceed. How do we generalize this, okay, this is like I mention ear-
lier again I have not done the entire explanation but I mention here this as
special case where your last function actually least square lost function.
So then how do we generalize to any loss function. So, we will look at

it now so we have input features, target we learn, initial model is mean of
all the responses and then what we do is we actually calculate the gradient
of the loss function with respect to your guess from the previous iteration
that is gradient of the lost function with respect to the prediction is what
1038
we calculate and that is our target, the second iteration as usual we updated
remember as you see here, we see, we have updated using this again the
learn model h1 which we have fit to the gradient to the lost function with
respect to the prediction from the previous iteration and of course we have
this height to hyper parameter to new one this again we have to do one more
optimization.
So in this case I will just write it down the one more optimization I am
talking about is you have to solve this problem, say we can still use squares
then use to m. This is done using another one called them line search there
just simple algorithm which I will not go through write now. So, think it is
another optimization you have problem you have to solve right now in order
to estimate this new one. The reason, we have do that recall is that we have
actually fit xi to the gradient of the lost function with respect to the predic-
tion from the previous iteration, so you can think of this as a correction that
we do to improve the prediction.
So than once you have done this then we calculate again the gradient of
the loss function with respect to the estimates that we have done and move
on to the next iteration, so here instead it will be xi , and then it will be
delta L with respect to delta F1 right I am leaving out all the variable but
you can fill them and then we will fit h2 then we will update, F2 = F1 + µ2 h2
and we once again estimate µ2 using a similar process we will keep doing this
till capital Mi iterations.
Let us say M is determine by cross validations, it is simply put you have

a felt out data set which you will test and after and find out the accuracy at
which you will you get the best find out the iteration at which you get the
best accuracy that is one we have doing it, okay. So, in general cross vali-
dation is the best method to estimate the capital M , so this is the way. So,
why it is this work so where we doing this and so just to give you some idea
why it works you all have seen gradient descent, the algorithm for gradient
descent.
1039
So let us consider our loss function, any loss function. This is a ground
truth this is a prediction remember we have also said the prediction we can
write it as F (x). x is our training data F is the ML model this case it can
be a binary regression tree. So, if you look at let say a least square loss
M
1
(yi − F (xi ))2 . Let us if we treat and just for sake of
P
function, L = 2M
i=1
making it easier to write and also easier to comprehend I am just going to
M
1
(yi − ŷi )2 this time to avoid too many subscription.
P
write L = 2M
i=1
So let us see if you consider this loss function we consider this as a func-
tion of these predictions explicitly, so L depends on ŷi , your prediction and
let us say our point and what we do when we are trying to do this regres-
sion problem is that we wants bring ŷi as closed to yi as possible, so you
can think of ŷi as unknown parameters to be estimated, so given this loss
function and ŷi as parameter that we want to estimate, so if you use gradient
descent to estimate ŷi given this loss function how do we do that we would
say ŷi . The update rule is ŷi = ŷi − ∂∂L
ŷi
.
So there are i = 1 to m parameters and if treat the ŷi remember ŷi are
nothing but the what you got F( xi ), and let us say if you doing as iterative
process then after every iteration ŷi is updated and how do we update it by
estimating ∂∂L
ŷi
and how do we estimate ∂∂L ŷi
, we estimate ∂∂L
ŷi
by fitting it to
a machine learning model h(xi ), this is what we do, so why do we do you
can say while we can just now since L is differentiable with just calculate
that. The problem with doing and then directly just keep updating does not
make sense. The problem with that is remember that we have a finite that
training datas we do not know we do not have all xi . xi is limited we do not
1040
cover the entire space of xi .
So since that training data is limited we only have very small number of
points. If let us say each x has 20 features and we only have thousand or
ten thousand data points so think about it, so we would not be spanning the
entire space so in fact if we can think of it in three dimensional, so it will
be very easy for to see that they would not be enough there are not enough
points so we are finite number of training points and we are trying to find
of let us say regularized this gradient by fitting it to the machine learning
model h(xi ), so once we estimate the gradient we update the parameter,
right, that is exactly what we did, infact here by for estimating the gradient
we are actually fitting it to machine learning model h(xi ), so this is where
this you can think of it this way this is where gradient boosting comes from.
So to summarize, we start of initial guess and at every iteration we would

update your initial guess by fitting hm to the gradient to the loss function
with respect to the estimates your prediction from the your previous itera-
tion and then also figuring out this hyper parameter to get better estimate
because remember you are fitting to the gradient, so then once if you fit the
gradient to take that a machine learning model have a multiplicative factor
here and try to estimate that factor.
So you can think of it also as doing a linear combination of multiple ma-

chine learning model in most of the case it just linear combination of multiple
regression trees or decision trees. Now it is also observed that see because if
you do that in very quickly in may be in few iteration you will tend to start
1041
to over fit this can happen. So, in order to improve convergence and also not
to over fit a learning rate is typically introduced, thi α this is between 0 and
1, I think 0.1 to 0.3 also some something of that sort you also have to set in
order to get better convergence. So, this is called shrinkage or learning rate
that you typically use, so you will estimate µ but you will figure out α like a
hyper parameter, so this is the general process for gradient boosting.
So let us if we consider let us say non least square model so in this case
an absolute deviation so y − F (x), I am going to leave out this submission
and all that just make it little bit clear and this is your loss function and if
you want to take the derivative with respect to F of x give that as an exer-
cise you will get sign of that okay that so that is it, so then second iteration
you start fixing you start fitting your machine learning model just sign , so
your feature would be the same input feature and the data points, the target
would be either plus or minus, so plus one or minus one would be a target
that is how you fit to a machine learning model and then of course you up-
date your model and you can also use a learning rate to improve convergence.
So if you consider your [y − F (x)]2 this is your loss function, you can
calculate ∂F∂L(x) , is nothing but so we get negative of y − F (x), if you have
infact of two here that two will go away, this nothing but your residual re-
member that, so we also saw that you know adaboost remember adaboost,
so adaboost correspond to e−yF , this is a loss function for adaboost, so you
can actually derive the update equation for adaboost starting from this loss
function, so that is one way of doing it.
1042
So we see that for a variety of loss. There is also something called Huber
loss which you have not covered you can look that up once you can use Huber
loss also as a loss function you can do it, so remember that I told you we
will go through this slide shortly that we are actually doing this using binary
decision trees so but remember that decision tree as it so own, one binary
decision tree but binary regression tree so if you use binary decision tree we
again we will see that the we use this same procedure for there also so do
not get just saying okay this is work for regression but it will work for pretty
much even for problems that you are doing classification I will just mention
that briefly when we go toward the end.
So let us look at the end of algorithm and summary so we initialize to F0

which is not mean response, mean response every tend data you take that
and for every xi that is output that is initialization and for substitute itera-
tion you calculate the gradient of the loss function with respect to the value
from the previous iteration your predicted values from previous iteration fit
a machine learning model which is a binary regression tree to that gradient,
estimate new, by solving a line search problem, update your model keep do-
ing that for fixed at number of iteration as indicated by your cross validation
study and that is your gradient boosting algorithm.
So algorithm like XG boost I think and also now a something called light
GBM, GBM are nothing but a gradient boosting machine one just one ter-
minology. They are just basically implementation of this technique, now
they of course very efficient implement XG boost, light GBM very efficient
implementation because they actually fit a binary regression tree effectively
1043
and there are lot of parameter that you can put in to fine tuned that you
see more flexibility depending on your problem but this is a fundamental
implementation you calculate gradient of you lost function so you treat your
estimated parameter sorry your estimated output.
So I know I said parameter, estimated output or predicted output as

parameters in you loss function and you doing gradient descent to estimate
those parameter or your predicted output so that is the idea behind all gra-
dient boosting algorithm, so this is a fundamental idea behind all gradient
boosting algorithm.
So in summary, can be used for classification algorithm also we have seen

it now for regression but it can also be used for classification, so when you
are doing classification you have used a logistic function, okay, there is away
to do that. The paper I am mention the beginning of the lecture actually
describe how to do that if you use the logistic function now for classification
problem our output of soft-max function for multi-class classification.
So remember that when are doing this way then the you will be actually
instead of the output it is just the probability where you rather than the
class itself you will be working with the probability as a real number then
you try to move the a probability closer to the closer to one in this let us
say what do you like to do as I mentioned earlier XG boost, light GBM or
the implementation of boosting algorithm with many hyper-parameter that
help you to tune it for a particular problem and they are tuned for optimum
performance, thank you.
1044
10
1045
Unsupervised Learning (Kmeans)
Hello and welcome back in this video on the next couple of videos we will
look at some topics in unsupervised learning specifically at the most popular
K-means algorithm and as well as the hierarchical agglomerative clustering
algorithm, so some of the slides are provided by Intel software based on their
curriculum offering and many of the content much of the content is also in-
spired by the text introduction to statistical learning as well as the elements
of statistical learning text.
1046
So brief overview of unsupervised learning is in order so far we have looked
at a variety of technique which can be classified as supervised learning algo-
rithm in the sense that you are provided with a bunch of inputs primarily
features as we going to call them and we have a corresponding label to each
one of those inputs so it is either a category label or sometimes it is just a
real number which means more like a regression problem in all those cases
that we have seen the for a given X which is just a raw input or some features
which have been extracted from the input there is a corresponding output Y
which is either a label or some real valued number in the case of regression.
However, in the case of unsupervised learning we are just given plain unla-
beled data now this is the case in most of the time in the real world because
labeled data is hard to come by.
So, we will be given a bunch of unlabeled data and the idea used to de-
velop a model from the data itself based on some metric that we define with
some metric we develop a model which automatically separates the data into
different classes or different bins if you can like to call them and that is it, it
determines the structure they are the unsupervised learning algorithm deter-
mine the structure from the model leading to the model and whenever there
is new unlabeled data we run it through the model to see in which one of
those bins or clusters that it falls or we extract an underlying structure based
on the model, so this is the general flow of a unsupervised learning algorithm.
1047
So for instance topic modeling, so we are given a bunch of text articles
may be recent newspaper articles consisting of unknown topics right, so based
on the text in the article we come up with a model which tells you which
separates the articles into several topics so we dont know what the topics
are primarily, but because they are themselves are extracted from the given
text, however, when a new articles comes in which is basically we have a new
are from with an unknown topic we run it through the model and then it
bins it with similar articles that is it predicts what to do what class it would
typically belong to.
So this kind of structure what is commonly in commonly called as unsu-

pervised learning wherein we really dont know what the classes are or what
the underlying structure of the data is except that we are given the raw data
and then we try to develop a model based by finding some underlying struc-
ture to the data, so the structure is again dependent on some metric that
we define as we will see for the K-means algorithm as well as the clustering
algorithm other clustering algorithm.
1048
So, for K-means clustering is a brief overview so here the statistics of so
this is basically we gather statistics let us say there is a website, web appli-
cation and we gather statistics of the users that use the web application, so
one such statistic is the age of the user and let us say we have like several
users in this case about five or thirteen or fourteen users and we just look at
we plot that age along an axis thats right there and it is very obvious from
the plot itself that there are two visibly very visually obvious groups in the
users age, based on the users age so that we can see here the green is one
group and the red one form an another group.
So let us consider a slightly more complicated in the case 2D, typically in

an unsupervised setting or in this case K-means algorithm setting you will
have a lot more features, in this case, income and age are features, so you
1049
will have a lot more input dimensionality would be typically higher but for
the case of illustration and understanding the algorithm we will just consider
two features which are the income and the age and the income right, so right
now what we will say that we say that there are two clusters.
So when you plot the data it becomes obvious that there are two clusters
very obvious two clusters but in again in many unsupervised learning algo-
rithm the number of such clusters is also unknown especially for K-means
where ever you use K-means, the number of clusters is also typically un-
known, so by that is another kind of hyper parameter that you would have
to determine.
So in this case, let us say we want to separate two clusters, so what we

do? so we initialize two clusters centers so to speak right so these are the
two cluster centers that we have initialized correct and so this is random and
again we can make some very clever guesses of the initial status based in
the data itself but right now we just say we will pick two cluster centers at
random and of course when you say at random it depends on the range of
your data, so you have to pick that again something which is rational choice
of the cluster centers.
1050
So we randomly assign it and what we do we move each center to the
clusters mean right, so what does that mean here so we assign two randomly
chosen cluster centers right and following which what we do is we calculate
the distance of each one of these data points so there are several data points
again initially we do not know what clusters they are in so we would calculate
the distance in this case Euclidean distance, let us say, Euclidean distance
between the cluster center and each one of the rate of points and assign the
data points to the cluster center which is closest.
So for instance this particular data point is closest to this cluster center
as opposed to that one and so we would assign this data point to this cluster
center, so similarly we would go across E or go across the data set and assign
each one of those points to the corresponding cluster centers which is closest
to them.
1051
So then once we do that we would get two different sets of clusters green
and blue in this case following which we would recalculate the cluster center
based on the membership, so now we have all this green points which now
been assigned to remember the our initial guess for somewhere here for the
green and after we have assigned the closest point to that cluster center we
would recalculate the mean of the cluster, of the points in the cluster, so to
get a new cluster center here these are the new cluster centers for the green
and blue.
1052
Once again we would redo this calculation that we did earlier that is
reassigned by calculating the distance of each one of those points the new
cluster center and then reassign the label accordingly, so then you see that
the label has changed through dramatically from here to there and then we
would once again recalculate the cluster center based on the new labeling, so
we continue this till the cluster centers do not change significantly and then
we stop right there and so now we have the membership belonging to the
clusters.
1053
So we can also set K equal to three and do the same calculation as we
outlined before and do the same calculation as we outlined before and we can
get about three clusters in this case red, blue and green and the problem is
with K-means is that it is very sensitive to the initialization, so we can have
a completely different initialization of the cluster centers and end up with a
different clustering set so it is a same data set so we had this red blue and
green clusters and with the different initialization we will get something like
this which is different from the previous clustering that we obtained.
So this is very sensitive to the initial conditions all, so we will see later
as to how we can resolve this problem, that is, how do we decide which one
of the cluster and as is correct and which one we should discard, so here is
one more example of K equal to three here once again K equal to three and
we have a different starting point or the initial values of the cluster centers
and we got different set of clusters.
1054
So to summarize we put it in formally, the K-means algorithm is obtain

by actually minimizing a cost function so which is shown right here so that
is the cost function, so that we are trying to minimize so it is W is what is
called the inter class or the intra class variance, so we will see what that is
soon so what does this trying to optimize the idea is to optimize the assign-
ment of each of the data points to a cluster center, so that this cost function
is optimized, so this is a combinatorial optimization problem.
So remember it is this cost function is minimized by assigning the ap-

propriate label to each one of the data point, so let us say we have data
points x1 , x2, . . . xn , and we have K classes in this case ,let us say, K is
like 1, 2, we say two classes, so there are labels one and two, so the opti-
mization problem is to figure out the right assignment, so we will say x1 is
one, x2 is one, x3 is two so on and so forth, xN is some class one again so this
arrangement here is the outcome of the optimization so what is the correct
arrangement of these labels, so that this inter class variance is optimized, so
that is the problem that we are trying to solve.
So we will write this out little bit more detail so we see that the inter-class
variance is defined by this formula:
P
1 XX
W (Ck ) = (xij − xi0 j )2
Ck i,v∈C j=1
k
10
1055
Here, P is the number of features in your input data, if there are P fea-
tures then each data point has x11 to x1P features right, so given a particular
class CK , so in this particular summation what you are trying to do is deter-
0
mine the distance between, (xij − xi0 j ), so where i and i are the data point
labels inside a particular cluster.
So let us say we have an initial clustering and we are looking at only the
elements in that cluster and we are calculating the sum of these distances
between the elements in the cluster right and we do so for all the clusters and
every time we would divide by the number of elements in that cluster right,
so this is the variance, inter class variance that we are trying to minimize we
can rewrite this formula as
P
XX
W (Ck ) = (xij − xkj )2
i∈Ck j=1
So xij is the ith element in cluster Ck and this jth feature and xkj is the
mean of the jth feature in cluster k, so that I hope that is clear, so xij is
the ith data in cluster Ck and j is the feature index, so similarly xkj would be
the mean of jth feature in the cluster Ck , so that is what we trying to do here
so it is basically we have we can rewrite this formula this way and this is the
cost function we trying to optimize there will be one more summation if you
bring this summation down there is one more summation here, so that is the
cost function we are trying to optimize, so it is a combinatorial optimization
problem so it is not-trivial to do.
So we use like a greedy approach or something called an iterative descent

that is what is commonly referred to as the K-means algorithm, so just to
clarify the notation again the data samples are indicated by x1 to xn there
are n data samples there are small k cluster index and we want to split the
data into k clusters where each cluster is denoted by Ck , so Ck is just a set
of all the data point is this is which correspond to that cluster, so because
the data point in this is range from x1 to n , so cluster C1 would be just a set
of indices of the data points so which basically, so let us say data point 1, 5
and 20 are in cluster C1 .
So that is just C1 right and there are other properties, all the Ck do not
have intersection, so one data point will only be strictly assigned to one clus-
ter and the union of all the clusters will give you the entire data set right, so
the idea again is to assign the data to cluster centers such that within cluster
variance that the squared distance between the elements or the data points
11
1056
in that cluster is minimized, so that is the one problem that we are trying to
solve. Its summarized algorithm.
Its summarised algorithm. So I just rephrase it differently said that ran-

domly assign cluster index to all the data points which is the same as saying
you choose a cluster centroid first and based on the distance to that centroid
you would assign a index cluster index to each one of the data point, so in
the first step the last function that we had mentioned earlier and I mentioned
earlier and have reproduces here for convenience is minimized with respect
to the mean value of the cluster, so we would determine the mean value of
the each of the clusters based on the data points in it and in the second
step given the cluster means you can minimize the cost function further by
reassigning each point to the closest cluster mean.
So this is done repeatedly and each and you can see that in each iteration
the cost function the value of the cost function will be minimized till point
where in there is no appreciable change in the cluster center location you
would iterate, so this is one of the more commonly used algorithm also refers
to the K-means algorithm there are other ways of solving this optimization
problem also.
12
1057
So the earlier question we had is what how do you determine the best
possible value of K , so that is not it is question which is easily answered, so
typically what one would do is to plot this W that we mentioned, the inter
class variance within class variance that we calculate as part of the optimiza-
tion. We plot that for the final optimized value we plot that as a function
of K and then typically you can notice this knee so we are in once beyond
the point the W value decreases but not as much so you will have a knee in
your plot this inflection point and we choose the K corresponding to that
inflection point.
So that is K that is, so there is an other problem that we also talked

about that is where in we were saying that if we had different initialization
then we would end up with typically different clusters right, so then how do
you figure out which is the best possible cluster to best possible clustering
that we have obtained now in order to do it again we would do the same
strategy that is we look at the cost function the loss function that we cal-
culated and choose the cluster with the least value or choose the clustering
result with the least W value.
So that is the solution to that problem, however, the choice of K some-

times will be given and sometimes one has to discern it from the data itself,
so it is basically very data dependent and so generally there is not too many
too much work being done in this area because it is something that is kind
of outside the algorithm itself.
13
1058
So let us look at the cost function or loss function that is optimized by the
K-means algorithm,okay, so we will set up the problem first, so initially we are
given data samples x1 to xn , so one to n the indices of the data samples. We
decide that there are K clusters in the data typically sometimes K is given
and sometimes you have to discern from the data itself, so the cluster index
goes from 1 to capital K, okay, and of course what we saw earlier was that
we want to split the data into K clusters and each cluster is given by Ck okay.
So what does Ck contain, so Ck has the indices of the data points okay,
so let us say data of x1 , x3 or x1 00, if n is let us say thousand all belong to
class C1 , so since C1 will have the indices 1, 3 and 100 okay, so thats how we
would define the Ck , so we want to split the data into k sets and the k sets
do not have any intersection, so intersection of any two Ck would be a null
set and the union of all this Ck would be the entire data set that you are us-
ing and how do we optimize, how do we pose this as an optimization problem.
So we want to assign data samples to cluster such that the within cluster
variance is minimized, so we can think of the within cluster variance as the
sum or the average distance between the members of that cluster, between
the data points in that cluster, the distance in terms of Euclidean distance
right, so if we if you were to post this formally this is the optimization prob-
lem that you are trying to solve where W is the within cluster variance so it
is right there, cluster variance and we want to take the sum over all the K
clusters, so for each cluster we calculate this W which is the within cluster
variance and we take the sum over all the clusters of course what are we
trying to do in the process we are trying to figure out the assignment of
the data points to the clusters that is what we saw here earlier right such
that this sum is minimized, so that is our optimization problem. This W
14
1059
itself can be written in this form wherein here the inner summation j runs
through the features so if it is one dimensional the data set one dimensional
that is you are just given n data points and each data point is just a scalar
value sometimes if you have n data points each data points can be a vector
of values which the size length of the vector being p.
So inner sum runs over the p features, so we calculate the distance eu-
0
clidian distance between two data points i and i , belonging to a particular
0
cluster K right and so for each data point sparse of data points i and i , we
calculate the Euclidean distance between them by summing over it is feature
distances and of course we take the mean value for every cluster, right. This
is the cost function for this single cluster and for course we have to sum this
over all the a cluster it is possible to rewrite this particular cost function
in this form, so here if we look at this xij in the summation where j is the
feature index so xij is basically the ith data point jth feature in cluster Ck
right and xkj the average it is basically the mean of feature j in cluster Ck ,
correct, its the mean.
So we are just trying to as we saw earlier in the illustrations we just trying

to find the distance of every data point within a cluster to the center or the
cluster mean and we sum over all the distances inside a cluster and then we
do the same for every cluster figure out and we add them all up and what
the optimization problem solves is that we figure out the correct assignment
of the data point to the clusters such that the summation that we calculated
is optimized.
15
1060
So if you were to post this as an algorithm and you are trying to solve it
this is a combinatorial optimization problem and the way to solve it would
be say an iterative techniques called iterative descent, so one way of looking
at it is we randomly assign cluster index to all the data points, so let us say
we have clay number of capital K then we would assign at random a cluster
index to each one of the data points, so the first step is to minimize the loss
function with respect to the mean value of the cluster.
So we know that once we the way to minimize the loss function is to

figure out the distance of each one of those points from the cluster centers
right, that would minimize this cost function that we saw earlier, now once
the cluster center is determined the second step would be to recalculate the
distance of all the data points to each one of the cluster centers and then
reassign the data points to the appropriate cluster that the cluster to which
the distance is the smallest, so we go back and forth between these two steps
and till there is convergence in a sense that the cluster centres stops changing.
16
1061
Agglomerative Clustering
Hello and welcome back in this video we will continue with unsupervised
clustering techniques we will look at agglomerative clustering. All the illus-
trations or figures in this presentation are courtesy of Intel software based on
their educational software and also the material is also inspired by elements
of statistical learning text book by Tibsirani his colleagues.
So in the previous video we have looked at the K-means algorithm and

we saw that in order to perform un-supervised clustering we need to know
the number of we need to know the number of clusters before hand, so that
was input to the algorithm and of course there were heuristic to determine
the number of clusters based on the inertia or the cost function the agglom-
erative clustering that we are going to look at it is an hierarchical clustering
technique there is not really required the number of clusters apriori, but it
result in what is called in dendrogram, it is like a binary tress structure where
the user is free to then choose the appropriate cluster from the tree structure,
we will see what that is the in the video as we will see what that is in the
next few slides.
1062
So an overview of the algorithm let us consider the same data set as we
had for the K-means algorithm where we had the income and age statistics
of the user of particular website so it is plotted in 2D, so it is easy to visual-
ize, so we have all this data points and hierarchical agglomerative clustering
starts of by finding the closest pair, so we treat each of these data points
has a cluster by itself and we find the closest pair and merge them into a
cluster, so what do you mean by closest square it is a based on a dissimilarity
matric, so basically in Euclidean distance we will see what the this various
dissimilarity matrix are in the latest slides. It is also refer sometime as the
linkage, so there are different type of linkage that we can use to find the
least dissimilar pair and group them, so based on a dissimilarity matrix we
will find the closest pair in terms of the dissimilarity matrix that is the least
dissimilar and merge them into a cluster.
1063
So the next iteration, we find the next closest pair and merge them so
in the first iteration we had these two and in the second iteration we found
these, so we continue this way so we will find two at a time, so we will be
essentially be looking at two data points or two cluster, so that time find
the dissimilarity matrix and merge those with the least dissimilarity matrix
so we can continue doing that so we will have this case so far about four
clusters, so this, a created four clusters and if the closest pair is two clusters
and we can the clusters themselves.
So we have this two here and in the next iteration we merge cluster if
they happen to be the closest pair, so we will see how to determine the dis-
tance between two clusters even these clusters are not individual data points
we are going to consider if they contain multiple data points we will look at
how to, that is where the linkage comes in see how to decide which one are
the closest pair in that case, so for now let us assume that is figure out a
1064
way to find out which two clusters are dissimilar based on some Euclidean
matrix and we are just merging them based on the least dissimilarity so we
can continue this way we can keep merging closest pairs and of data point
as well as the clusters.
And till just you can see we keep merging we will have the number of
clusters will begin to reduce, so let us start here we have what five cluster
in this particular sorry six clusters in this scenario and each color denoting
a cluster and so if you can merge them we get so we have merge this two
point of this cluster and we have five, so then we have here to there we have
four clusters, three cluster two and one, so finally the algorithm stop when
the entire data set is assign to a clusters, so that is the idea behind this,
so we can start out with individual data points pair them up based on a
dissimilarity matrix to least dissimilar once get to merge into a cluster.
1065
So as we keep the progressing through the data set at some points we
will have to start merging clusters also based on the dissimilarity matrix and
soon at the last of the algorithm will result in one cluster so that will be.
This is a bottom up approach, there is a top down approach also wherein
we can divide the data set based on the dissimilarity matrix and keep divid-
ing till we get to a point where each individual data point is a cluster. So,
these are two ways of doing it we are looking at the bottom up approach now.
So let us consider the points where we have about five clusters so this
is situations and we will see how we can actually decide on the number of
clusters, so it is slightly subjective but visually it is very appearing.
1066
So in this case so this is the distance, distance is two dimensional problem
we can visualize it so here we have each of these clusters plotted here, so we
will call this cluster index if you want to call it that. Actually it does not
this axis does not have a label pretty much so we have plotted of this clus-
ter at a height, so this is the cluster distance, this the dissimilarity matrix
that we are talking about, so when we merge two clusters, it gives rise to
a new cluster and the height is basically the dissimilarity measure between
the two clusters that we merged, so now we have five clusters which are re-
sulted from the merging of several clusters and the heights along this axis
is basically the dissimilarity matric between the cluster that were merged to
obtain these clusters, okay. Think of them as each of them as a single clusters.
So now if we, so we started of with five, so let us say we merge the

closest so we will get four cluster, so if you go back here so we just merge
this with that so we have four clusters now then if we want to visualize it
this way here we have merge this two clusters to get a new one here and
1067
this height is basically the dissimilarity measure that we compute between
these two cluster, right, so similarly we can go ahead and merge two other
cluster to a three cluster that is what this two were merged and again this
height is the indication of the dissimilarity measure between this two clusters.
So here is slightly intuitive in a sense that we realize that as we keep

merging clusters the dissimilarity metric keeps increasing, so it is automati-
cally increasing function because initially we would merge points which are
very close to each other and as we start to merge clusters we will end up
merging cluster which are slightly very dissimilar so thats way this height
will start increase, so that gives you a very clear idea and how we can merge
the clusters or how we can figure out how many cluster are there in a data set.
So then we can go to the point where you have about two clusters and
1068
of course here this gives raise to this one, right, so finally what we can do
is merge these two to get something here at top, so then how do we do the
split so thats why we look at this gray line here, so based on the height along
this axis so this is the cluster distance we can draw line that is basically
threshold at which we want to create the clusters, so when we cut cross this
way we will end up hiring one two three four unique clusters in our data
set, right, because we see that merging these two cluster leads to very large
dissimilarity matrix, same way merging these two cluster this one and here
give rise to a very large dissimilarity matrix.
So we can just make the cut here, that give rise to four unique clusters.
So, the idea behind using this cluster distance is a threshold is that if you
look at the, so each of this point here this are the nodes of the tree that
we have constructed, they represent clusters and all the data point in that
particular cluster are basically more similar to each other than to data points
in another cluster represented by a same node at the same level basically,
so we obtain this data points by combining several clusters data points and
it is plotted of the height of the dissimilarity matrix that we calculated for
merging those data points, okay.
So even though all of them all pretty much around the same height the
point is that all the elements in the cluster are more similar to each other
than two elements in another cluster at the same height along this axis, so
that is the point behind this dendrogram, so this what is called dendrogram,
which is basically a like a binary tree and we decide where to make the cut,
so that we end up with unique clusters.
1069
So how we talked about dissimilarity matrix and now we have to see how
are these dissimilarity matrix calculated, so basically just a Euclidean dis-
tance between the points there are different types of them, so we will look at
one by one so what do you mean by Euclidean distance, let us say you have
a p features and xi denotes data points, so what we would like to calculate
0
is basically this between any two data points. Let us say i, i , we want to
p
(xij − xi0 j )2 , so that is the matric that
P
calculate the Euclidean distance,
j=1
we typically calculate.
So we can calculate this matric between any two pairs of data point that
we want to merge but than we are trying to merge clusters what do we do,
so then that is when the linkage becomes slightly more important, so here
we have different type of linkages the first one we look at this single linkage
which is the minimum pair wise distance between clusters, so what is shown
here, so are the pairwise distance between these two and these two clusters,
1070
can see that this is way to do that this is two calculate this distance here
between all the points in one cluster and all the points in the other cluster.
So we will get pairwise distances between the two cluster between all the
points in both the clusters across the clusters and we will choose the as a
dissimilarity matrix you will choose the minimum pairwise among them, so
that will be the before in this case we can see that this two point are the
closest, so then that will be the distance the minimum distance pairwise dis-
tance between the cluster same thing for this two clusters here these two here.
So we can do the same so here four clusters so than we should be able

to calculate the minimum pairwise distance between each of them and thats
what these black arrows represents, so between the each of the colored clus-
ters we can calculate the minimum pairwise distance and use that as a dis-
similarity matric.
10
1071
The complete linkage calculates the maximum pairwise distance between
clusters as a name implies basically calculate the maximum distance between
the points in the cluster between the two cluster, so in this case as an illus-
tration, see that between these clusters here the green and the light blue, this
black arrow represent the maximum pairwise distance between the elements
in the cluster or the data points in the clusters. Similarly between the yellow
and the red this particular arrow represent the maximum pairwise distance
between them.
So as we saw earlier for single linkage we will do the same for all the clus-
ter taking each pair at a time and calculating the maximum pairwise distance
between them and we use that as a dissimilarity matric, so average linkage
is a basically the average of the pairwise distances between the element in
the cluster, so we take all the elements of cluster one and all the elements of
cluster this case cluster two and this case blue and green and you calculate
11
1072
the distance between all pairs of them taking one from each cluster and you
do the average.
So it is the average the if you take the individual distances and do an av-
erage of all those distance and that is the dissimilarity matric. So, this case
it is kind of a expected that this arrow here represent the average distance
between the green and the blue clusters. Similarly this arrow represent the
average distance between the red and the yellow cluster.
So we can show in this plot shows the all the average distances between all
pairs of cluster so finally we will come to a ward linkage or a centroid based
linkage we just merge based on based inertia this inertia is basically what we
use for the K-means algorithm, so it is a distance between the centroid of the
cluster that is what this is. So, based on that than we can link so this would
12
1073
be merged if you are looking at centroid base distances, then these two are
the closest centroids and you would merge them, so that is the dissimilar-
ity matric, basically the distance between the centroid of each of the clusters.
So agglomerative clustering to summarize to you start at the bottom that

is individual data points you treat each individual data point as a cluster and
recursively merge them pair at a time producing a grouping at the next high-
est level so we will have N minus one levels and how do you merge the pair
you merge a pair based on the smallest dissimilarity measure between the
pair we saw that the different dissimilarity measure we used the recursive
grouping lead to a dendrogram or a binary tree where the node representing
the cluster and the root node representing the entire data set and as I men-
tioned earlier there is data point on each of the nodes are more similar to
each other than the data points other nodes at the same level and the level
along the dendrogram is depend on the dissimilarity measure between the
two clusters that we re combine to give there node, so that is the that is how
that give rise to the dendrogram.
13
1074
So to summarize the algorithm itself proceed likes this so initially each
data point as a cluster and merge the two closest clusters so initial iteration
mostly be that is the data point and we keep looping till you get a single
cluster and then so in this case this two will merge we saw earlier just repro-
ducing the graph here and it is up to the user to determine what is the ideal
numbers of cluster so you can make the cut at whatever level you choose, so
for instance you can make the cut at this level or you can make the cut at
this level.
So you can look at this it makes more sense to make cut here because of
the large change in the cluster distance just the dissimilarity matrix and that
you will hope the data that you rise to one two three and four, this is the one
two three and four unique cluster from the data set, so with this we conclude
a our short for into our supervised clustering techniques to summarize this
two merger that we look at L-means and the agglomerative clustering tech-
niques they are both useful when we have large amounts of unlabeled data
and we are just trying to figure out and underlying structure in that data
set, given that we do not have any pure information regarding the data set
this techniques are very useful and we can also be use very large data set.
So typically they have been used for instance agglomerative clustering

has been use in DNA micro rays, came in has been use for image processing
so on and so forth, so even with very large data set it is possible to use this
techniques given the absences of any other a pure information or label infor-
mation and gives you a variety of the what they grouping is like in the data
thank you.
14
1075
Professor Dr. Ganapathy Krishnamurthy
Probability Distribution Gaussian, Bernoulli
Hello and welcome back. In this video we will look at probability distributions notably
Bernoulli and Gaussian distribution. All the slides were provided by Dr Christopher Bishop
based on his text book PRML ‘Pattern Recognition and Machine Learning’ textbook.
So we will first consider the Bernoulli distribution which deals mostly with binary variables.
So basically these are 2 states either 0 or 1 okay. So if you have a feature which can take
1076
either of 2 values, that would be a good example of binary random variable okay. So just to
have an example we consider coin toss experiment with a damage or biased coin, which
means that the probability of either getting heads or tails is not the same, so maybe there is a
higher probability of getting a head than getting a tail when you toss it away okay.
So we will consider the random variable x which is linked to the event heads or tails okay. So
depending upon the event x will take a certain value, so if you toss the coin and you get
heads, x gets the value 1 and if you toss the coin and you get tails, x gets a value 0 okay. So
that is you variable x is referred to as a binary random variable and it is either associated with
you getting either heads or tails based on coin toss.
Now what we do is we will assign a probability with that event. So since this is not an
unbiased coin, this has a bias, we will say that the probability of getting a head given 1 coin
toss is given by the value mu okay. So mu denotes the probability that if you toss the coin
once the probability of getting a heads is μ and this is a notation, so this μ is the parameter for
the Bernoulli distribution right. So if the probability of getting a head is μ okay, what would
be the probability of getting a tail would be? Since heads or tails are mutually exclusive, it
will be 1−μ, right. So given a coin toss event, if you want to predict the probability of getting
either a heads or tails then this can be given by this cumulative expression.
So which is basically probability of x given μ, where x is the event heads or tails, is given by
this expression P( x∨μ)=μ x ¿, right. So how does this works? Because if x is heads, then
x=1, so the probability of x equal to 1 given μ would be just P( x=1∨μ)=μ1 ¿, it is correct
1077
right. And the probability of x equal to 0 given μ is P( x=0∨μ)=μ0 ¿, okay, so that is correct.
So the probability of getting a heads or tails is what this P( x∨μ)=μ x ¿.
If you want to link it to some experiment in that way this is what it denotes, the μ is the
parameter for the Bernoulli distribution, so in problems where we define…where we are
dealing with this probability distribution most of the time the problem will be about
estimating μ, that is what we will be doing most of the time. So given μ we know we can now
formulate the expression for getting a heads or a tails, so P( x=1∨μ) and P( x=0∨μ).
The average value of x is E [ x ] =μ, okay (we will see that with a small example some time
later), and the variance of x is var [ x ] =μ(1−μ), okay. So these 2 you can actually calculate
based on your formula for expectation as well as the variance, right. We will look at more
concrete examples later on but this is just an introduction to the probability mass function for
the Bernoulli distribution, right. So this is given by Bern( x∨μ)=μx ¿, where x is the event, x
can be either 1 or 0 depending on whether it is heads or tails, okay. So this is just not for coin
flip, whenever you have any variable that you can associate with like either or choices, like
there are only 2 choices for that variable can take, then you associate Bernoulli random
variable with that quite easy okay.
So the parameter estimation for the Bernoulli distribution, so it is called the maximum
likelihood technique, we will see what is that again later on but just from a common sense
1078
point of view we should be able to follow this argument. So we have a set of these events, so
let us say N coin tosses okay, so out of which we get m heads and N−m tails okay.
So what is the probability of observing this sequence of m heads and N−m tails okay this is
what it denotes, so the probability of observing the data set given that the probability of heads
(this is the condition probability) is basically since each of these events are independent, we
just to do the product of each of these probabilities . So that is what this expression does
N
p( D∨μ)=∏ p( xn ∨μ). So since you are doing N coin tosses, then for each coin toss what is
n=1
the probability of observing the head, that is p( x∨μ) and so since there are capital N coin
tosses, you just multiply all of them that is what gives us product. And instead of p( x∨μ) we
substitute the expression here if you solve for the probability mass function for Bernoulli
distribution.
So recall that p( x∨μ)=μ x ¿. Here we have sequence of N coin tosses, so for each coin toss
we can write this as a probability, since there are N coin tosses each of them being
N N
independent, we just take a product p( D∨μ)=∏ μ x ¿¿. So this symbol
n
∏ ❑here
n=1 n=1
represents the product okay of N terms okay.
From a computational point of view since multiplication can lead to some of these numerical
errors, if you take a log of this expression, since you take the log of products then it
decomposes into a sum of the logs okay. Because you can take the logarithm, let us say
log( AB )=log( A)+log (B), you can do that right. So that is possible, so that is what we have
done for each of those terms, so that is the expression right here. Taking the log of this
product decomposes it into this sum right. Then if we substitute the values in here we can
expand it in this fashion, so if you want to write it down in a concrete way.
So log ( p( D∨μ) )=log ¿. So the log of this entire expression (so this a product of capital N
terms), so we can write this as a sum of N terms by taking the logarithm inside, so that will
N
get ∑ ❑, so the log( μx )=xn log( μ) right (these are products right, this will become a
n
n=1
summation here) plus log ¿. So this summation is for this entire term okay.
1079
So this is a simple thing of writing the logarithm of products as sum of logs okay, that is what
d
I have done. So then if you want to estimate μ, you can set ( log ( p ( D∨μ ) ) )=0 and then
dμ
N
1 m
you can calculate μML = ∑ xn = as total number of heads divided by the total number of
N n=1 N
tosses, so m is the total number of heads okay. So you might wonder why would we consider
this expression and why do you want to take the derivative of this, so you can consider this
log ( p ( D∨μ ) ) like a loss function okay. So how is that loss function?
So we formulated this probability of capital D giving mu, capital D is your data, data is
basically the sequence of coin tosses, what is the probability of observing this coin tosses
given this Bernoulli parameter, that is what this expression evaluates to right. Because each
coin tosses independent of each other, we write the probability of observing the data as the
probability of observing x 1 times the probability of observing x2 times the probability of
observing x 3 so on and so forth up to the probability of observing X N , so the product all these
terms. But then we know the expression for the probability, it is a Bernoulli random variable
x. So we can plug that expression in there and the rest is just algebra. So now we have got the
probability of observing this data set, that is this sequence of coin tosses given this mu. Now
what we want to do is we want to maximise the probability of observing this data set right, so
that is the idea being construct in this probability distribution. So if you want to maximise the
probability of observing the data set, the same as maximising the log of the probability of
observing this data set with respect to μ the parameter that you want to estimate.
So by taking the derivative with respect to μ of this functionlog ( p ( D∨μ ) ) then you can
calculate μML . So this function log ( p ( D∨μ ) ) is referred to as the log likelihood and this
p ( D∨μ ) is referred to as the likelihood, likelihood of observing this data set given this
parameter μ. So this is a Bernoulli trial, sequence of Bernoulli trials and it is characterised
(we assumed that it is characterised) by this μ which is the probability of observing x=1
okay. So we then construct the probability of observing the entire data set which is
x1 , x2 , ... , x N , we assume that each trial is independent of each other.
So it is just the product of the probabilities of observing each individual trials, that is what
given in this expression. And you just to plug-in the formula for p( x∨μ) for a Bernoulli
distribution and then take the log because then it is easier to process that way and then
1080
differentiate this log with respect to μ and set it to 0 and that will give you…and from there
you can derive μML (I have not gone through the algebra but it is not too hard to do okay). So
this expression this probability of observing a dataset given the parameters of your
distribution is referred to as the likelihood and taking the logarithm of that usually referred to
as the log likelihood okay.
Now we consider another distribution which we are often using from now on, it is known as
Gaussian distribution. Here again the form of the distribution is given here. We are not
familiar with it, you will become familiar with it by just going through this expression
1 . The σ is the standard deviation, the 2 is the variance (again note that this
Ɲ (x∨μ , σ 2 )= σ
¿¿
is for a one-dimensional variable, so x is a one-dimensional variable and we are looking at
one x), μ is the mean of the distribution okay. And what does this expression evaluate?
So if you have a value x of that event (so this is for continuous variables), so if you have a
value x this expression will evaluate the probability of observing x, what is the probability of
getting x okay. So what is plotted on the graph here is the graph of the probability
distribution, wherein the x-axis is x and y-axis is the probability of observing x okay, and the
mean is μ. This width of the distribution is 2 σ, where σ is the standard deviation and σ 2 is the
variance.
Now if you are not sure what kind of variables lead to this kind of the distribution okay. So
just to give you a practical example let us consider let us see a class with hundreds of students
1081
and then they take an exam. So you get the marks of every student, so we have a student
serial number (I will just have a table here okay, serial number of students then marks okay
for that particular test), so the student serial number goes from 1 through let us say 100, and
the marks also well it will also go from (in this case let us say they go from) again 40 to 100,
everybody passed let us say we have a hard threshold of 40 and everybody got over 40. You
can imagine that marks will go from let us say 41, 41.5, so on and so forth but let us say
nobody got less than 40 let us say something like that okay and somebody could have got 100
so on and so forth okay. So this is our data, we have about 100 students and the marks in an
examination. So what you do is you construct a histogram. How do you construct a
histogram? So let us take an axis, this is marks okay and it goes from just for the sake of
convenience you have 40 to 100 okay. So to care to construct a histogram is you make them
bins on your axis, so we will make bins of size 5 so 40 to 45 is one bin, 45 to 50 is another
bin so on and so forth okay up to 100 okay. So what you do is you count the number of
students whose marks fall between 40 to 45 okay let us say some number, this axis is the
number of students okay.
So you count the number okay so you do that for let us say all of them, typically what you
will get will be something along these lines I am drawing it smooth so you will get something
like this. So you make bins of size 5, so 40 to 45, what are the number of students who got
marks between 40 to 45? What are the number of students what who got marks between 45 to
50 so on and so forth and you just plot it as a histogram wherein the x axis give you the…you
can plot with the center of the bin, so instead I have just indicated here as 40 to 45. Instead of
doing that you can just say 42.5 and plot that number 52.5 and plot that number so on and so
forth 47.5 and plot the number of students between 45 to 50, so you can plot that and you get
this histogram okay. And if you normalise the y-axis here the number of students by total
number, what you get is the probability distribution of the marks that the students scored in
that particular examination okay so it is a very good way of summarising what your class
performance is like okay so you can easily say if you can let us say somebody gets 40 marks
you can plug that and see you know whether is one of the few students, what are the
probability that he got…how far away he is from the main thing of that sort.
So to summarise the statistics of a class this is a good way of doing it because then you can
easily assigned grades using this okay. So that is one of the things that people do usually they
calculate this curve and use this curve to assign grades right. So you know the mean this is
the μ, we will see how the mean, the standard deviation are estimated for a Gaussian
1082
distribution, something that is very familiar to you but the assumption that you are making
every time you do this calculation is that you are assuming that your data falls under
Gaussian distribution right.
So I am giving you examples of marks, can do that for let us say you measure all the heights
of all the students in your class okay you have a class of hundred or several hundred and you
measure all their heights and if you plot again histogram of heights based on some bins then
also the distribution will look very close to a normal distribution or for that matter many of
the quantities that you measure might look like a Gaussian distribution because any
calculation that you usually do makes this assumption, we will see what those calculations
are. But this is in 1d okay this is for 1d example, one-dimensional example where in you have
marks okay.
Sometimes the variable x you are looking at is multi-dimensions let us say of dimensional
(you are used to the dimension n but for the purposes of this lecture I am going to use D
okay), so if your dimensionality of x is D. So let us say x has many features that is one way
of looking at it right. If x has many features then x should be some vector of length D, this μ
will be a vector of length D because each dimension will have its mean so μ will also be like
D, this Σ it is called the covariance matrix and it has size D × D okay, and √ ❑ is the modulus
is determinant of the covariance matrix square root okay, so |Σ| is the determinant okay. So
1 is for a multi-dimensional x this is how at the Gaussian distribution formula

Ɲ (x∨μ , Σ)=
¿¿
is like okay. So what we have plotted here in this figure basically is the let us say if D is 2
okay, 2 variable x 1 and x2 okay and these red lines are contours of constant probability or
constant probability mass function.
So for instance if you plug-in this x 1 and x2 find out all values find out the range of all values
of x 1 and x2 for which probability is similar or same then you can plot this…the locus of all
those x 1 and x2 is basically this okay. So what it shows is that the x 1 and x2 are correlated
slight kind of right, because x2 seems to linearly increase with x 1 okay. So this plots are that
way useful to figure out whether there is a correlation between your variables when you are
dealing with multi-dimensions variable okay. So Ɲ (x∨μ ,σ 2 )= 1 is a general form of the

¿¿
Gaussian distribution , so it comes in the exponent so you will be typically writing it as
something like this for one-dimensional.
1083
So then for a Gaussian parameter estimation, remember we saw for Bernoulli distribution we
estimated, we looked at how we can estimate the Bernoulli parameter mu which is a
probability x=1 right, so how do you do the similar thing, because we have the normal
distribution, it is also called the Normal distribution or the Gaussian distribution is denoted as
2 parameters μ and σ 2, so we want to estimate μ and σ 2, how do we do this? Okay so recall
that this is for continuous random variable, so x can take any real value but usually data is
discrete right, we only observe for specific x right, that is the thing.
So we then again once again what you do is based on the observation, so this blue points or
the observations we have, so based on this observation we calculate the probability of
observing the data or the likelihood. Likelihood of observing X refers to the data that we
have observed, so let us say we have N points, once again like we saw in the Bernoulli
distribution the probability of observing the data set is the probability of observing each one
of the individual data points and we know that the probability of observing each data point is
given by the normal distribution, so the total probability of observing this data set is the
product of each of the individual probabilities, which is what is given by this expression
N
p( X∨μ , σ 2 )=∏ Ɲ (x n∨μ ,σ 2 ) okay.
n=1
1084
So then what we do is similarly we take the log of that is likelihood and the log of the product
transforms to summation of the logs which is what is given here
N
−1 N N
l n ( p ( x∨μ ,σ ) )= 2 ∑ ( xn −μ ) − ln ( σ ) − ln ( 2 π ). I urge you to actually work this out
2 2 2
2σ n=1 2 2
yourself because it is virtually trivial algebra but it will also be comfortable with the
expressions here. Now once we have this expression what we want to do like we saw earlier
we use to maximise the log likelihood of observing the data right. What would be that? that is
μ and σ 2 should be such that the probability of observing this data is very high, then the way
to do that would be to take the derivative of this expression.
1085
d d
So ( ln ( p ( x∨μ , σ 2 ) )) =0, and then you will do ( ln ( p ( x∨μ , σ 2 ) ) )=0, and then once you
dμ dσ
solve for it you get 2 results which both of it should not surprise you, this is called the
maximum likelihood technique, μML is nothing but the average of your data points
N
1
μML = ∑ xn okay, and your sigma square which the variance (this parameter in the
N n=1
Gaussian terminology is called variance) which is nothing but you calculate the variance of
N
2 1 2
your data given μML is σ ML = ∑ ( xn −μML ) , that is what you get okay.
N n=1
So this is typically this statistics that you always calculate right, whenever you have any data
set or you are like we will characterise it by using the mean and standard deviation. When
you do that what you are assuming is that your data comes from a normal distribution. So
these are the assumptions that you are implicitly making when you calculate mean and
standard deviation, the assumption being that your data is normally distributed, that they are
drawn from this normal distribution with a particular σ and particular μ okay.
So in the context of machine learning what we will again, we will model data using these
distributions, so for instance where we have problems which involves 0 or 1 choice we will
use the Bernoulli distribution and where the problems are continuous variable involved will
use the Gaussian distribution and the problem (you will see that the problem and I will
explain this in a later video) we will see that what you are trying to do is to…we will end up
modelling this μ using our data (for even linear regression can be bought in this form we will
do that in a video soon), so we end up modelling or we are trying to estimate this μ, σ or in
the case of the Bernoulli distribution again the Bernoulli parameter μ, those are what we try
to estimate every time okay.
As the output that we are looking at and implicitly what I wanted to show with this video is
that typically for a given data set of you calculate μ and σ okay. This is done implicitly in
many of the models that you are using, we will see that in a later video.
1086
So if you want to consider a multi-dimensional data where each of the x 1 have let us say D
dimensions okay, the procedure is the same, so we still construct the log likelihood of the
data given a dataset consisting of n points, and independent and identically distributed (that is
what i.i.d stands for independent and identically distributed). So the probability of observing
x 1 times the probability of observing x2 times the probability of observing x N so on for all the
NN data points, the products of each of these probabilities is the probability of observing the
entire dataset okay. And we do that like we did it for the 1 dimensional case and we can still
do the log of the probability and you will get an expression like this
N
−ND N 1
ln ( 2 π )− ln|Σ|− ∑ ( x n −μ ) Σ ( x n−μ ).
T −1
ln ( p ( X∨μ , Σ ) ) =
2 2 2 n=1
Once again you can take the derivative with respect to μ (remember now mu is a multi-
dimensional variable) and you also take the derivative with each of the elements in Σ (again
sigma is a matrix here so it has D square elements), okay. So once we do that we can do the
similar process wherein taking the derivative with respect to μ and this Σ, set it to 0 to obtain
the value of μ and Σ okay.
In this context remember this, in order to describe a Gaussian probability distribution, in

order to estimate or calculate a Gaussian probability distribution we just need 2 quantities,
one is this remember we just need the mean and this variance okay. So these are referred to as
the sufficient statistics (we call that right). So we are going to need all the data points, in fact
1087
that is one of the reasons why you construct this probability distribution, if you have a large
data set you can actually summarise the data set with just 2 parameters.
In the one-dimensional case it is just mean and standard deviation, in the multi-dimensional
case you will have more than that because remember this is D × D matrix. So you will have D
means right because each feature has a mean, and this will turn out to be symmetric matrix,
D ( D+1)
so you will have independent parameters, right, because if the symmetric matrix
2
D ( D+1)
these are the unique unique elements of Σ, okay. So it is still a lot, remember if
2
there are hundred features it is a lot of parameters to estimate. So typically when you are
solving these problems using probabilistic techniques what people usually do is they assume
that Σ is diagonal okay, so that you have only D parameters to estimate. In fact they assume
that you know if Σ=σ 2 I , so then you just have again only one parameter okay. So depending
on how you model you can reduce the number of parameters you would like to estimate using
multidimensional Gaussian okay.
So we will set the derivative of the log likelihood function to zero again with respect to μ,
and you can actually solve to obtain that for μ, again for the multidimensional case is the
N
1
mean of your data points μML = ∑ xn , again remember that these are vectors, so for every
N n=1
dimensional you have to independently calculate the μ, and once again the covariant matrix is
1088
N
1
given by this expression Σ ML = ∑ ¿ ¿ the average of the expectations of your data. Again
N n=1
member x and μ are vectors of dimension D okay,
So we have looked at 2 distributions, Bernoulli and Gaussian. Bernoulli distribution is used to

describe variable that can take 0 or 1 values. A typical example is a coin toss, so it has only
one parameter which is basically the probability of observing x=1 okay. x=1 might respond
to any event, like for instance even we have talked about this the coin toss where in x=1
corresponds to heads, x=0 corresponds to tails. So what the distribution is characterised by
probability of x=1 which is given by μ and we also saw how we estimate μ, mu is just the
number, if you have given a sequence of N coin tosses we just calculate the total number of
coin tosses where it landed as heads which corresponds to the total number of N where x=1
and the ratio of the total number where x=1 divided by the total number of actual coin tosses
N , this gives you an estimate of μ, there are in the Bernoulli distribution.
For the Gaussian distribution there are 2 parameters, one is the mean and the standard
deviation or covariance in higher dimensions, and mean is basically the mean of your
observed data points, remember that you have to take the mean across every feature
independently, and the covariance is nothing but the covariance of x, again it is 0 centered so
T
( x−μ ML )( x−μ ML ) , the mean of this value is your covariance matrix. Again remember x and μ
are again D dimensional where D is the dimensionality of x, okay.
So subsequent to this we will look at various leftover techniques in machine learning, we will
look at SVM, we will look at naive based classifier, and then we will move onto maximum
likelihood estimation, how would a price lets says linear regression, then maximum a
posteriori methods, and then finally Bayesian regression. We might postpone Bayesian
regression to the next week, but maximum likelihood estimation and map techniques we will
look at this week. Thank you.
1089
Covariance Matrix of Gaussian Distribution
Small note about the covariance matrix of a Gaussian distribution.
So we again the slides are from Dr. Christopher Bishop textbook PRML, so you are all
familiar right now with the Univariate Gaussian where you know the probability density
function is given by some normalizing constant and exponential p( x)= A e ¿¿ ¿, for
Multivariate Gaussian then we had this and then the exponent terms (I am just writing down
1090
T
the terms in the exponent) where we had ( x−μ ( x ) )Σ −1 ( x−μ ( x ) ) and here again x and μ are
vectors depending on the number of dimensions that you are operating in, ok.
Now if you take this covariance matrix ok of course it is defined as the expectation of this
T
quantity cov [ x ] =E [ ( x−E ( x ) )( x−E ( x ) ) ]=Σ , that is your covariance matrix and it is actually
if you have D dimensions then if x is D dimensional variable then the covariance matrix has
D × D elements ok. So what is this what will it show you if there are different forms of this
D × D matrix ok.
So the first plot here corresponds to the most general form of D × D wherein all the elements
are different, so which means that there is some correlation between the different features that
you have, and what is drawn here there are let us say we are looking at two dimensional
variable and there are x 1 and x2 are your features ok then these lines correspond to lines of
constant PDF ok, so then you can see that these are kind of like tilted ellipses.
So this is corresponding to a full D × D matrix ok, here Σ is a diagonal matrix and every
element the diagonal has a different value then you get the lines of constant PDF will
correspond to ellipses is like this, and again the major and minor axis will tell you what the
actual diagonal elements are, these lines would correspond to the size of the diagonal
elements ok.
The third case is when it is again diagonal but then you can write the σ 2 I where all the
diagonal elements are equal and then you have the lines of constant probability density are
given by these circles ok, so if you plot the locus of constant density function there will be
circles ok. So here in the most general all this shows that there is some correlation between
the variables this kind of tilted way of drawing this tilted ellipses which are each of these
lines are correspond to constant PDF values, ok.
So if you plug in all the x, so all pairs of x1 , x2 and so on these lines evaluate to the same PDF
ok, so that is what the contours we have drawn. So because if you think about x 1 and x2 ,
p( x1 , x2 ) will rise about the plane ok, and then if we just project them onto the plane that
these are the lines we will get, we cut and project on to the link that is how links we get.
Again, so there are three cases the most general case D × D and Σ is a diagonal matrix but
each diagonal element is different and here again diagonal but each of the diagonal elements
are the same.
1091
Central Limit Theorem
Hello and welcome back, so in this video we will look at the central limit theorem, there is
only one slide and courtesy by Dr. Christopher Bishop based on his PRML textbook.
So what is central theorem limit theorem state (now that we have looked at we have familiar
with Gaussian distribution and the Bernoulli distribution), what it says is that the distribution
of the sum of N independent and identically distributed random variables becomes
1092
increasingly Gaussian ok, so as N becomes very large ok. So if you look at these plots so the
first plot corresponds to N=1, remember that we are considering the sum of N independent
identically distributed random variables.
So in this case N=1, the first there is no summation here, so we are just drawn numbers from
a uniform random distribution between 0 and 1 ok. Now if we consider the sum of two
numbers each of which is drawn from a uniform random distribution and we plot the
histogram of those drawers then we see that again it is approaching a Gaussian, as we go to
N=10 it looks more and more like a Gaussian, ok.
So again to reiterate we are only considering sum of N numbers ok, I will say numbers but
summation of N random variables which are drawn from the same distribution, so that is the
condition yeah independently drawn from the same independent and identically distributed
ok. So why is this important? Because if you look in many of our problems, if you look
machine learning problems where we use probabilistic models typically we will end up using
the Gaussian distribution as the model, ok.
So then how do we justify that? So typically the justification comes from here, where if you
consider your data point for instance let us say you have a bunch of data points and you say
we decide that they are Gaussian distributed, so then what is the justification? So one
justification you give depending on the problem is that each of these data points can be
considered as a sum of N numbers drawn from a similar distribution and N is a very large
number, then we can say that each of them is drawn from a Gaussian distribution, so that is
the idea behind using the central limit theorem.
So one example is, you are all familiar with you know cell phone cameras where you take
pictures right ok, so but then what is that cell phone camera do? It is just that it is collecting
the light photons that are reflected off the object that you are photographing ok, and it is
integrating them in a sense it is counting the number of photons literally ok, that is what the
camera did, the camera detector in your cell phone that is what it does.
So even though if you look at counting statistics, that counting statistics are not they are
usually called Poisson distribution ok, we have not done Poisson but the definition says
Poisson distribution, however when you consider a large number of light photons and your
detector is actually you know integrating the counts ok, each time a photon falls the intensity
increases you can think of it that way.
1093
So then when you get the output picture, each of the pixels corresponds to a detector element
in the camera, but then even though the individual statistics are poison, because we are
integrating over a large number of those photons we can see that the intensity can be
modelled as a Gaussian distribution ok. This is the importance behind central limit theorem,
is that the limit of as N being very large and we are considering a sum of i.i.d random
variables gone from the same distribution then we can say that the result can be interpreted as
being from a Gaussian distribution ok, thank you.
1094
Naïve Bayes
Hello and welcome back. In this video we will look at the naive bayes classifier which is
another supervised learning paradigm.
So all the slides are courtesy of Intel software and we would be using the have such.
1095
So we will start off with some probability basics, some of these would have been covered in
some of previous videos, but just for the sake of some continuity. So we use Venn diagrams
to denote the space of events X , so the probability of event X is given by P( X ) which is the
circle highlighted in yellow, and we have another event Y which is again given by the Venn
diagram circle right here, and again highlighted in yellow.
So we have two events X and Y . The joint probability of their occurrence is denoted by
P( X , Y ) (of course the single event probabilities are P( X ) and P(Y ) respectively), again
here in the Venn diagram it is the region of intersection. The conditional probability again is
P( X∨Y ) is basically the region here if you go back and we see that it is the parts of X that
are also in Y . So that denotes the conditional probability P( X∨Y ), what is the probability
that event X occurs given that Y has occurred. Similarly we can define conditional
probability P(Y ∨X ) ok.
So the joint and conditional probabilities are related to each other. So the joint probability
P( X ,Y )=P ( Y ∨ X ) P( X ) or equivalently P( X∨Y ) P(Y ). This relationship you should have
seen before.
1096
So given this rule we can then equate P ( Y ∨X ) × P( X)=P( X∨Y )× P(Y ). So that is the joint
conditional and how the conditional and joint probabilities are related, ok. So if you invert the
probability so if you can make use of this relationship right here and if you can bring the
P( X ) to the denominator on the right-hand side, so we get this relation is the conditional
probability of P(Y ∨X ) in terms of conditional probability P( X∨Y ) and the individual
probabilities P( X ) and P(Y ), ok.
Similarly then the denominator P( X ) can also be derived from by marginalizing the joint
❑ ❑
probability over some event Z ok, so P( X )=∑ P( X ,Z )=∑ P( X∨Z)× P(Z ). Here we are
z z
given to choose Z such that they are in terms of a mutually exclusive of possibilities. So for
instance if you are looking at a diagnostic test, we can say Z is the event that test is positive
and test is negative ok. So two options in Z, test is positive and test is negative, and you can
think of X as the event that you have the disease ok. So that way you can make an intelligent
choice of Z and you can marginalize over Z to get the probability P( X ) in the denominator,
ok.
1097
So this relationship that you have seen so far is the Bayes theorem which you should have
been introduced to where the conditional probability P(Y ∨X ) is written in terms of
P( X∨Y ) and the individual probabilities P(Y ) and P( X ), ok. So this can be interpreted in
the context of unsupervised learning or supervised learning (this specifically in supervised
learning) this P(Y ∨X ) is referred to as a posterior probability, P( X∨Y ) is the likelihood,
P(Y ) is the prior, and the denominator P( X ) is known as the evidence, ok.
So typically this is the one that you say very difficult to calculate because we see that it
involves marginalizing over another variable, that typically turns out to be an intractable
calculation in most cases, ok. So there are ways of avoiding this and it is what we will see
later on at least for the name Bayes we will see why we can avoid this, ok. So in the context
of whatever we have seen so far I think I might have mentioned when we talked about
probability distribution, the idea of likelihood ok. So where we calculated likelihood of
obtaining data, that is exactly here right there ok. And later on when we get to look at MAP
(Maximum A Posteriori), Bayesian regression we will once again revisit these concepts, ok.
1098
So like we mentioned earlier calculating, this denominator is the difficult task, and most of
the time we try to not calculate it, and in fact we will just ignore it as some constant in most
problems. So this whatever you calculate as posterior it won’t exactly be a probability and it
is not normalized. So this is a difficult thing to calculate and we try to get over this
calculation in most problems that we encounter, ok.
So what we do now is, in the context of supervised learning classification we will write the
Bayes rule and see how we can be used for classifying, so the naive Bayes classifier that is
how it draws its name from. So we replace Y with the class label C, and of course this is
1099
again Bayes rule for that class, of course like I mentioned earlier we are not calculating the
evidence in this case because it will involve sum which is difficult to calculate, ok.
So the P( X ) is something we are not taking into account here, we will see later because we
see that what we are trying to do here is we are trying to calculate P(C∨ X), so the
probability of the class given X , so the class is basically you know if you have a classification
problem it will be multi class or binary class problems, feature is basically your input data
points training data. So the probability P(C∨ X) is basically your classification right, that is
we are trying to classify based on input feature X .
So if you have three classes we can write 3 such expressions C 1 , C 2 , C 3, and in each of them
the denominator is going to be the same, P( X ) is going to be the same. So when you are
comparing the output P(C∨ X) up to the same multiplicative constant we can compare them,
so we do not actually have to explicitly calculate P( X ) right.
So given this based on Bayes rule we have this expression P(C∨ X)=P( X∨C)× P(C), we
are ignoring P( X ) because it is the same for all classes alright. So the idea is we have to
evaluate this for P(C 1 ), P(C 2 ) (let us say a 2 class problem), and we just compare these two
numbers (of course since we are not normalized by P( X ), it will not exactly be a probability
value, we can call this as a score, and we compare this scores we get in whichever is the
higher score we assign the input data point to that class) ok.
1100
So how do we actually go about doing that right? So let us just look at the calculation, so we
want to calculate P(C∨ X) ok, so X in this case it is not one dimensional right, it is n
dimensional, which means that there are n features ok. So we can explicitly write down this
P( X ), this is what is written explicitly here P( X 1 , X 2 ,..., X n∨C)× P(C), and then we apply
the chain rule of probability to this particular expression P( X 1 , X 2 ,..., X n∨C) which then we
can write it in this form P( X 1∨ X 2 ,... , X n ,C)× P( X 2 ,... , X n ∨C).
So if you look at it, it is P( X 1 ,... , X n ), you have taken X 1 as I say it is an individual variable
and then group these ( X ¿ ¿2 ,... , X n )¿, so then we can once again we can use the joint
probability. How we write the joint probability in terms of conditional probabilities? You can
use that rule to write down this P( X 1 ,... , X n ∨C)=P( X 1 ∨X 2 ,... , X n ,C)× P( X 2 ,... , X n∨C),
ok.
So this is the way we learn to write the joint probability in terms of the conditional
probability. If you recall I will use the A and B instead of X and Y , we wrote
P( A , B)=P( A∨B)× P(B), that is exactly what we did here. But then there is a conditional
probability here which is conditioned on C the class and of course we have taken that it
account when it wrote it up, ok.
So this so if we can do this successfully right, so we have done this for X 1, again we will take
this and we can write it as P ¿, we can write this expression like that right. So we can keep
writing this way and decompose, but the problem is doing this calculation it is hard because
then we have to do so many terms, if it has 100 features you literally have to expand and then
we have to calculate the probability of one feature conditioned on the others all right.
1101
So this is hard, so then what we do is to make this so called Naive Bayes approximation
which says that all the features are independent of each other. So if you go back and write
down this term here P( X 1∨ X 2 ,... , X n ,C), so then what happens is since we assume that all
features are independent of each other then this is the same as
P( X 1∨ X 2 , ... , X n ,C )=P( X 1∨C ), because it does not depend on X 2 , X 3 ,... , X n takes.
So this is the Naive Bayes approximation or the assumption if you can call it. This is an
assumption because you will have seen in real data, features are typically not independent,
there is always some correlation or some kind of relationship between the features, but then
we just ignore those and then we independently calculate these quantities to get to the
probability of a class given a feature X ok.
1102
So if you have a multi class problem we calculate this probability for each one of those
classes and assign into the class of the largest probability for a given new data point X , ok.
N
So this can be written in product form and like this P(C∨ X)=∏ P( X i ∨C )× P(C). P(C) is
i =1
probability of C of the class, it is called the prior probability of the class. So this comes about
by making the assumption that all the features are independent of each other, so that way the
dependencies of one feature on the others goes away and that is why we are able to write it in
this simplified form, ok.
So as I mentioned earlier, the way to do class assignment is what we call the map rule or the
maximum a posterior rule. Since we calculate for every class k if there are K classes, we
calculate this probability of P(C 1∨ X) , P(C 2∨ X),... , P(C k ∨X ), so K numbers will be
calculated and the data point will be assigned to the class with the highest probability, so that
is the argmax there ok. This is the gist of the Naive Bayes classifier, so as this means select
potential class with largest value, ok.
1103
It is no longer a probability because we have not normalized by P( X ), so it is just some

number, it is called a score, you have added the score and it is assigned to the class with the
largest score. So multiplying so many numbers because in this case let us say we have a
thousand features or hundred thousand features this can cause overflow problems or under
flow problems etcetera, especially since you are multiplying with probabilities and cause
underflow problems.
So then you just calculate the log of that, so this in another score that you can calculate. So if
you take the log of this whole expression then the product gets converted to a sum here
N
log( P(C k ))=∑ log( P( X i ∨C k)), we can do that.
i=1
1104
So we have looked at this in the binary decision tree algorithm, so we look at how to predict
tennis with Naïve Bayes right. Remember for a bunch of days we have the following features
Outlook, Temperature, Humidity and Wind, and based on these features we decide whether to
play tennis or not, so it is a 0 or 1 problem, ok.
So then let us see how we can use naive Bayes to deal with this, ok.
1105
So example how do you are trained the Naive Bayes classifier ok. So the probability of play
equal to Yes is 9 over 14 (we have 14 data points) and probability of play equal to No is 5
divided by 14 ok. So this is your P of C (remember from previous slide this how we calculate
the probability of the class right) and we calculate the probability of that particular feature
given the class right.So how do we do that?
So the outlook takes value sunny, overcast and rainy ok. So if you go back and look at the
counts, out of the 9 times when we say play is yes given that twice it is sunny, 4 times it is
overcast and 3 times it is rainy. So that probability we can just calculate as a frequency right.
Similarly out of the 5 times when we decide not to play given that 3 times it is sunny, 0 times
is overcast and 2 times it is rainy.
Similarly we can do that for temperature, for the 9 times we say we decide to play twice it is
hot, 4 times it is mild and 3 times it is cool, so the probability is that, ok. So this is basically
we are calculating these numbers probability of P( X i∨C k ), so what does it mean? We are
trying to calculate probability of temperature equal to hot when play equal to yes, that is this
column.
Similarly this column is probability of temperature equal to hot when play equal to no. It is
not exactly the whole column, this is basically this particular row that is what we are
calculating. Similarly the probability of temperature equal to mild when play equal to yes is
what we calculate here, and with probability of temperature equal to cool and play equal to
yes is what we calculate here.
1106
So here play equal to yes corresponds to C=1, play equals to no corresponds to C =0, and
these are the X i that we are talking about, ok. So we can calculate these tables based on the
training data right, so we can calculate these conditional probability tables based on a training
data. Similarly for each class classification C equal to 1 and C equal to 0, we can look at
humidity and say probability of humidity equal to high given class C equal to 1 here so on
and so forth, ok.
So one good exercise is to go back and see if we can do this calculation yourself ok, so that is
the way to do this calculation right. So similarly for wind, probability of wind equal to strong
given play equal to yes which means here again corresponds is equal to 1, and this
corresponds to C equal to 0 ok, these are the calculations that you have to do.
So what we do in Naïve Bayes model is we work with the given training data and estimate
these conditional probabilities just by doing the frequency of occurrence ok. So that is the
only calculation that we do for the naive based model.
1107
So then what happens when new data point comes in? so new data point is a new set of
features, so we have to give outlook, temperature, humidity, and wind. For this X outlook
happens to be sunny, temperature happens to be cool, humidity happens to be high, and
windy is equal to strong ok. So then we want to calculate the probability P(C=1∨ X) and
probability P(C=0∨X ), ok.
So how do we do this? We go back to the Naïve Bayes formulation, so if you remember
N
correctly it is probability P(C)× ∏ P( X i∨C ), that is our formula. This is the probability
i =1
P(C∨ X), this is what we did. So the probability P(C=1)for the first row which is we
9
calculated as (if you remember there are 9 data points which are yes).
14
So we here in this case when you do probabilities you just do the relative frequencies. So out
9
of 14 data points 9 times you decide to play yes, so probability of yes is , ok. And for each
14
one of them the probability of sunny given that you play tennis, probability of cool when you
given that it play tennis, probability of high winds when you play tennis, so we can look at all
these numbers up from the tables that you calculated here ok, and we just plug them back in
there. Similarly we can do the same thing for probability of not playing tennis.
1108
So for instance this is how the table outlines how we calculate it. So for feature outlook equal
to sunny these are the conditional probabilities that you evaluate and if you calculate it this is
the overall label, these are the P( X i∨C) and this is P(C), you take the product of all of
them, whichever comes up with the higher score, here in this case is 0.026 is higher than
0.0053, so you just decide that you will not play tennis , ok.
1109
1110
So we can we will address the problem of what happens when you have a category with zero
in it right, so what does that mean? So for instance if we go back (I think there was fun
remember this), so what happens here is, so here probability of overcast given that play equal
to no is 0, there are no events there in your data point with outlook is overcast and when you
decide not to play ok, you do not have that data point. So then when you construct this
empirical probability you get 0.
So what is the problem with that? Let us say instead of Outlook being sunny I put Outlook is
overcast, especially here I put overcast, then I will be multiplying with the 0 (one of these
numbers is a 0 right), instead of 1 this one will become a 0. So then we get a 0 probability so
that does notices make sense, ok.
1111
So one way to address is if you get any feature which has like a 0 probability or frequency
you, you can just ignore that feature ok. But that is like a very strong decision to make
because it might be the more important feature except that maybe you do not just have data
points right. So there is something called Laplace Smoothing which is basically you add 1 to
the numerator, so you can increase the probability by assuming some uniform distribution,
that is what you are doing.
So for instance in this case let us say there are two features (and that is what is here) and one
of the features has 0 frequency of occurrence (you have a 0 there) ok, and you have in this
case P( X 1∨C), instead of putting a 0 you just put 1 over the count that you typically have
the denominator for that particular class plus n ok. So for each one of them you can do a
similar thing. So since you have increased this artificially, you can also do the same thing for
the other features. So even for feature 2 where you actually have data (where it is not 0, you
have the count for P( X 2∨C) basically this is a number of times that X 2 happens when class
C=1 and then you increase that also by 1, so this is a technique is called Laplace Smoothing,
wherein you artificially add to the numerator of that class and that way you make sure that
there are no zero multiplications ok.
Go back and show this probability calculations for this particular data set right. So how do
you calculate these probabilities right, so we saw that (we will just walk through this) we are
trying to calculate (if you go back and look at the Naïve Bayes formula) these numbers
probability of X given category. In this case the example we looked at the category is 1 or 0,
1 corresponds to yes and 0 corresponds to no, ok.
1112
So in this case we are trying to calculate the probability of feature and we wanted to take a
particular value given that we decide to play ok, given that class is 1, that is the probability
that we want to calculate. So the way to calculate that is we compute the number of times in
the data set that class equal to 1 occurs is 9 right, and among all the data points per class
equal to 1 how many times outlook equal to sunny is 2, that is how you calculate the
probability here based on the training data.
So based on the training data in this case this is categorical right, this particular feature is
categorical because it takes on three values, which are sunny, overcast and rainy. Similarly in
fact this entire data set all of them are categorical, there is no actual continuous variable or
anything but that is how you would calculate it. So for it says (let us go back and you look)
how many times do we play? we decide to play we say yes category this is equal to 9 times
ok, and 2 out of this 9 we have outlook sunny ok, so both 2 times when outlook is sunny we
decide to play.
So that is why the probability P( X∨C) (in this case where outlook is sunny given play equal
2
to yes that is what this corresponds to) is ok. So this is what we have to calculate for every
9
one of those variables for every class right. So the other way is if we have let us say 3 or 4
classes then there will be a 3 or 4 columns here and for each one of those columns you have
to calculate the probability of that particular feature occurring with that particular value ok,
and make that table.
So Naïve Bayes training basically involves making this probabilistic table, so depending
upon the variable and your training data you will have to figure out a way to calculate the
probability.
1113
Maximum likelihood Estimation Intro
Hello and welcome back. So in this video we will look at maximum likelihood estimation,. So
you are all familiar with the linear regression model. So we have a bunch of data points Y i and
we have the corresponding features X i , so typically we formulate this model W T X , again X i can
be multi-dimensional or just be single variable but that does not matter in this case. So our model
what we try to fit typically is when you look at linear regression we looked at the least squares
m
1 T 2
loss function, the loss function is nothing but ∑ ( Y i −W X i ) .
m i=1
So that was over model, and we took the derivative of this model (we use gradient descent) to
estimate W . So why is said that we use least squares? So why is it power two? We have looked
at L1 and L2 norms and all that but still why is this the best way to do it? There are many ways of
approaching this, one way of figuring out this least squares loss function is by looking at the
probabilistic perspective.
1114
So we will consider individual errors, so for instance we will define this a variable
2
ϵ i =( Y i−W X i ) , that is the error. So this error might be due to noise in our measurement,
2 T
missing data may some features are missing this could be the error due to those concepts, so
because some X i might be missing for a particular data point and maybe there is an error in
measuring X i as well as measuring the Y i. So typically one assumption people make about this is
that these are Gaussian distributed. So what does that mean? Means that the probability of
observing a particular epsilon square we assume that it is given by a Gaussian distribution with
1
zero mean P(ϵ 2i )= . So once you make this assumption then we can rewrite our problem.
√❑
1115
1
So because we know what sum square is, so we can write this as probability P(Y i ∨X i , W )= .
√❑
So this is our model, so we are just plugged in out model here and then we will just reinterpreted
this probability as P(Y i ∨X i ,W ). So the idea behind likelihood is to maximize this likelihood. So
1
which means maximize this expression with respect to X i and W . Another way of looking at
√❑
it is we can also maximize any other function of P, in this case if you take the negative log of P
that is −log ( P ( Y i ∨X i , W ) ) then (I am going to ignore some of the additive factor here you will
2
get) if you take the log of the exponent and then the negative sign you will get ( Y i−W T X i ) .
Again I have just written it for one data point, so that suppose your training data consist of m
m
2
data point which we saw, then it will be just the summation ∑ ( Y i−W T X i ) .You know why that
i=1
happens is because if you want to maximize the probability of observing this data set assuming
that they are individual IID then what we get the probability of the data set is the product of the
probabilities of the individual data. So this will be a product of m such terms, so for each term
2
this is ( Y i−W T X i ) . So if you do actually calculate the probability of the entire data set it is a
product over this each of the data points, and we take a log of this it should get a summation.
And this is our least squares cost function.
1116
So I have just not done the step where I do the product but that should be something you should
be able to do. In another way of looking at it is even if you assume that our data is Gaussian
distributed, what we are modeling here is the mean. If you remember the form of the Gaussian
distribution then the Gaussian distribution form as (I am just going to use different variable but
2
−( X −μ )
that should not throw you off) exp
{ σ
2 }
, so this μ is the mean and this σ 2 is the variance,
so this is what we refer to as the maximum likelihood estimate.
So when you do the least square cost function we are assuming that the errors are Gaussian
distributer or basically we are trying to model the mean using this W T X (by mean in the sense of
for every measurement on an average, so you can think of it that way) and we are trying to
estimate this parameter assuming the Gaussian distribution, and when we try to increase the
likelihood of observing the data given the parameters, which way what we are trying to estimate
then we end up with the least squares loss function.
Of course you we can also show that for classification problems at least for the two class
classification problem if we start off with the Bernoulli distribution we can end up with the log
loss or the binary cross entropy cost function, which we can do that are pretty much the same
way. So that see an introductory look at the maximum likelihood estimate. If time permits either
this week or in the subsequent weeks (couple of weeks left) we will look at some maximum a
posteriori estimate where basically we will be using the Bayes rule and incorporating a prior. So
we will use Bayes rule, we incorporate a prior, so what we calculate here is we incorporate a
prior and prior times likelihood to give you the posterior probability, that is what we usually. So
if we take that one step further do a full Bayesian analysis it is called Bayesian regression. Time
permitting we will address these two topics in the next couple of weeks, which basically 11 and
12 weeks will be able to address these topics thank you.
1117
PCA – Part 1
Hello and welcome back, so in this video we will look at the principal component analysis, I
promise that we will just take a look at it because this is one of the techniques used for data pre-
processing or data normalization before you use them as input to machine learning algorithms or
deep learning algorithms in general.
See theoretically what happens is increasing the number of features should improve
performance. However as you increase the number of features there are some problems, because
it turns out that many features leads to worse performance. So if you have lots of features where
leads to worse performance, because what happens is we need more training examples if you
have more features. Think of it as filling up a space. In this case as this example shows you if
you have only one dimension we can sample that dimension with fewer data points, however as
the dimension becomes two you need more data points to sample the entire space of you
problem, and also as the number of dimension keep increasing the number of points you need in
order to sample the space of a problem begins to increase, and so if you have a lot more
dimensions and less number of data examples you run into problems with training your
algorithm.
1118
So the solution to this problem is dimensionality reduction, wherein we reduce the number of
features that represents the data. So how do we do this? The idea is to reduce the dimension by
selection subsets by feature elimination, and the way the algorithm does that is what we refer to
as a principal component analysis.
So in this case we have a data with two features, number of cigarettes per day and height. This is
some sample of the population and at this point do not worry about what this data was collected
for and what are the classification or regression algorithm, we just know that there are these two
data sets are there.
So there are two features and it seems like what is more important to observe here that the
feature increase together, so they are correlated. So I would like you to go back and think about
the naive bayes algorithm where the assumption is that the features are not correlated actually.
So in this case if the features are correlated can we reduce the number of features to one? So this
is a very easy problem to visualize, so that is why we brought with two features, so typically in
machine learning problems you realize that will be hundreds of features and then you will be
forced to deal with the completely different sets.
So in this case we have features which is height and cigarettes per day and they simply
correlated. Also remember if there are only two or three features, we can actually plot this
correction plots and actually manually eliminate some of them right, that is a possibility. But
1119
then let us say if you have hundreds of features, you can imagine the number of correlation plots
you have to do to see how things are correlated or which of the features are correlated. Actually
PCA is another way of looking at tells you which are the most uncorrelated features and which
ones are only significant because they are correlated to other features in some way or the other.
So the way to do that can we reduce the number of features to one? So the way to that is if we
can fit this line (you can figure out this line), and if you can project the values of these training
points onto that line, then all we need is the locations of these projections along this line and that
will be your new axis, you can think of it. So the way to think of PCA at least in physical
problem is the rotation of your axis. So if you go back, so this is your axis you have cigarettes
per day and height, and the idea is now we know that there is some correlation, so can we rotate
these two axis? So this axis comes here and this axis rotates this way, where the values along the
other axis are very small and the values along this axis are very large, so then we can afford to
ignore the other one.
1120
PCA – Part 2
So that is what we have done, we have managed to project our data which is of two column
which contains heights and cigarettes per day into a single axis. So you can think of it as some
combination of height and cigarettes. So as I mentioned earlier you can also think of it as rotation
of your axis, so I am sure at some point in your college or school you must have seen this, when
you have axis X and Y and then you rotated by an angle let us say theta to get a new axis X
prime and Y prime it would be possible to express X prime and Y prime each of them as
combination of X and Y. So this you must have seen at some point in your high school algebra.
So this is accomplishes something similar, it is not exactly an adjective but it is more a
complicated transformation involving both this features heights and cigarettes per day.
1121
So finally that is what we have, we have created a single feature which is a combination of
height and cigarettes, and this process of reducing the dimensionality of the data is what we
called principal component analysis.
So mathematically what we can state is that if you have given an N-dimensional data set X, our
idea is to find and N by K matrix U, so that when we do this operation U transpose X and we get
this new data Y which has reduced dimensions, in the sense that it has dimension K which is less
1122
than N. So that is what given here in this part of following operation, that is precisely what want
to do, we want to put in terms of linear algebra matrix operations.
Now let us consider this dataset, this has two features X1 and X2 (there is X1 and X2). Visually
we can see that a lot of the information is along one axis, it is along the axis here shown by the
black arrow. So which means that this axis has the maximum variances (if you can construct this
axis, it will have maximum variances) and we have another axis which is in this case orthogonal
(90 degrees) to this current axis (I have drawn orthogonal to that axis), the variances along the
other axis is very small. So if you project your data along this axis (the axis that we first drew)
then you could be able to capture most of the information because the variances is highest in
direction, And the other direction that we consider is the direction orthogonal to the axis that we
are presented and that the variation along that direction is much lower. So this the idea behind
doing principle component analysis.
So what we need for that? We need two things, we need the direction of this axis we (as I said
direction means we need this vector), this vector we need to know and we need to know the
length of the vector because the length of the vector helps us to determine whether the variances
is high in this direction or not (the larger the length the more the variances along the direction),
and similarly we need to know the length of the direction of the other vector. So that is what
principal component analysis helps to determine.
1123
So how do we accomplish it? We would not go through the actual algorithm for determining this,
but what we can show is what the process does. So this principal component analysis
accomplished using what is known as singular value decomposition (there is a mistake, this is
usually called singular value decomposition, not single value decomposition). It is a matrix
factorization method that normally use for principal component analysis and is not required a
square data set, so your matrix not be a square matrix, and it is used in this python package
Scikit-learn for PCA even MATLAB is a command SPD which help you to similar identity
decomposition for mount square matrix.
So in this case M by N matrix, M is five and N is 3, five rows three column, so this N is the
number of features. So what singular value decomposition does is to factories this matrix into a
product of three metrics, this is called the left singular vector, this is right singular vector, and
this is called the singular value matrix. so if you do this on a square matrix, what will get is the
singular value matrix is nothing but the diagonal matrix of eigenvalues and the left and right
singular vectors are nothing but the eigen matrices of eigenvectors.
So if you have M by N in this case 5 cross 3 matrix, we have in this case five data points and we
have three features, that what these three features we want to reduce it. So we get this U matrix
which is a left singular vector which is of size M cross M (five cross five), the singular value
matrix which is actually diagonal matrix so in this case this two dimensions have zero, so we
1124
have these three singular values, and the V transpose has dimensions N cross N which is three
cross three. So this is the output of the SVD algorithm and how do we actually reduce a
dimension right?
Again the idea is visually we saw in that data set there is one direction along which the variances
is maximum, however when the number of dimensions increases it is hard to visualize, so then
the visualization is done using this particular matrix which is the singular value matrix. If you
see that then these rows are already irrelevant because they do not correspond to any useful
singular values, so the least singular value corresponds to this, so we can drop the row and
column corresponding to this singular value which corresponds to this particular column here
and this row here in the V matrix. So again we can drop these also because they do not
correspond to any useful direction.
So that way you get a U matrix which we can project to two dimensions. So we have reduced the
number of features from 3 to 2 by throwing out one direction corresponding to the least singular
value. So this is the truncated SVD that we used for dimensionality reduction. So what we have
to do is we can throw out this rows and columns and then actually do the matrix multiplication to
get a correct form of your data matrix.
So remember the each column corresponds to a feature, so that is how the data points should be
arranged. One more information before you do PCA is that they all have to be zero centered, so
1125
each column has to be mean subtracted, so that the mean of this is zero, this is one of the
required from doing a PCA. So this is one of the most often used technique for dimensionality
reduction, it says that this is the pre-processing step for any machine learning algorithms like
classification, regression, now even if you want to do deep learning with images you can actually
do this, except that now you have to raster up the images that into columns and rows depending
on how you arrange the data.
So this is like the work cost technique and it has shown to improve performance in many
algorithms. Because what it does is, it removes the unwanted feature by re-projecting your data
into a new axis, it removes some unwanted features and only keep those features which are
relevant by themselves having a maximum various.
So some key points remember I would like to retreat here, let's say we have this data set here, all
this red and back data points represent the two dimensional data X1 and X2, which is 2D, the
idea is to find a new axis to represent this data, but the condition for the new axis is that it still an
orthogonal axis, so that is enforced by the algorithm. The idea is we find this the axis
corresponding to the most variation in the data and then find another axis corresponding to
perpendicular to it and then look at the variation in that direction so on and so forth.
So the principal axis or orthogonal axis is very important and the axis we want to keep has
maximum variance along that direction. These are the key points that you have to remember. The
way this is accomplished by doing this SVD and since there are in the case of more than two or
three dimensions the best way to figure out which axis is the most variant is to look at the
singular value matrix, then you can keep the first K significant terms such the K less than N,
where N is the original dimensionality of your data. Here we conclude with principal component
analysis again there are lots of resources on the web regarding the actual algorithm itself we will
post some on the discussion for an as soon as only open announce for up.
1126
Support Vector Machines
Hello and welcome back, in this video we will look at support vector machines. An
introduction to support vector machines all the slides are provided by the Intel software.
Let’s first look at the relationship to logistic regression. So let’s consider this example where
we are trying to determine whether a patient survived or he was lost based on the number of
cancerous modules from the patient, okay. Now if you use logistic regression then the idea is
if the output of the logistic function is greater than 0.5 then we classify it as class 1, and if the
output of the logistic function is less than 0.5 then we classified as class 0.
Now if you recall the logistic function, when the argument to the logistic function is greater
than zero then we get 0.5 or above and if the argument to the logistic function is less than
zero then we get a value which is less than 0.5, right. Let’s take this function then, if the
1
argument becomes zero then it is =0.5, right. this is the point at which we draw the
1+1
threshold, correct. So far all arguments which are greater than zero the logistic function
outputs greater than 0.5, and all arguments less than zero the logistic function outputs less
than 0.5, though we classify that as a class 1 or plus 0.
1127
So the idea behind fitting to a logistic Sigma is such that, for all class 1’a the argument is
much much greater than zero and for all class 0’s the argument is much much less than zero,
okay. So that then this function evaluates to some value much greater than 0.5 and we can
with confidence say that it is class 1, okay.
In another way of looking at this problem is in the case of SVM it looks that in terms of
decision boundaries, right. So the idea is, okay let’s say we draw a decision boundary here,
since we have only one feature we can just think of it like a threshold, we draw the decision
boundary given by this particular line, okay. Then what happens, we have 3
misclassifications corresponding to this blue points are misclassified, so then we can just try
another addition boundary which is over here, right. Again we have 2 misclassifications
corresponds to those 2 points, right there. However if we draw this line to some point in
between this one blue and one red then we have no misclassifications. However there are
multiple choices for this, right. We have an entire range for this and it is very difficult to
determine where you can draw this line.
So the idea behind SVM is to figure out where to draw this line. So in 1D this is one straight
line and of course in 2D also we have a line, more like a threshold in 1D and in 2D we have a
line, in multiple dimensions it corresponds to a separating hyperplane, okay. We will look at
what these two dotted lines mean later on but this width typically is known as the margin,
that’s what the and SVM is referred to as a maximum margin classifier where by maximizing
that margin. So we will see that later on.
1128
However, the point behind support vector machines is to figure out this boundary, this
separating plane, line or boundary between two classes. And one of the criteria here is that
the 2 classes should not have any overlap, so they should be linearly separable, so that is a
problem we will be considering. We will be looking at linear SVM, okay. So that’s the first
introductory topic to SVM. So we will look at classes that can be linearly separated, okay.
So if we are coming back to our 1D example, intuitively you can see that this is the ideal
separating part, right? Because then we have some leeway here and here too, and we also see
that these two dotted lines they correspond to the points on either classes (so there is one in
the red class, there is one in the Blue class) and these lines passed through points which are
closest to these boundary, okay.
So you can see that these dotted lines pass are very nearby the red and this dotted line passes
very nearby the blue dot and these two points are the closest to the optimal boundary that we
have drawn here. This is the optimal boundary, okay. So then these points are referred to as
support vectors, okay. I will show why they are called vectors because we can just think of
them as vectors in N dimensions, that’s all. It’s a point in N dimension, if your dataset is N
dimensional, okay. So that is the objective of a support vector machine to find out this
optimum boundary with respect to the support vectors which are basically the closest points
of either class to the separating boundary or separating plane that we are looking at, okay.
1129
So we will consider slightly more complicated in the sense 2D, that way we can actually
illustrate it much better. We have two features, number of malignant cancerous nodes and the
age of the patient, and let’s say we are just trying to predict survival, so there are 2 types, the
patient is lost or the patient survives, which can be denoted by these red and the blue dots.
And in the case of SVMs typically we will shift from 0 or 1 classification to +1 or -1, okay.
So in logistic regression we saw it is class 0 or 1, in SVM is it is typically the class labels are
-1 and +1, okay. So then how do we draw this line? Like you actually saw before for the 1D
case, we can just draw many such separating lines, so as we go through we will see that each
of them has a misclassification associated with it. But the way to interpret this again one side
of the line is class +1 the other side of the line is class -1, okay. So that’s how we determine
the classes, okay.
So this is a very nice separating boundary much closer but you see it is also very sensitive. So
suppose we have a Red point here, it’s kind of dubious in the sense that it is very difficult to
figure out which last it belongs to. So again the similar way this line is also not the best line
because again it is very sensitive to small points very near that boundary.
1130
So ideally, so this is an excellent boundary, right? This is very nice because when it actually
separates the two classes very clearly and then we have this margin, right? Once again if you
look at the margin that we talk about is the distance between these classes, okay. As
determined by points which are closest to the separating line in this case or separating plane,
okay. This is the optimal separating plane or separating line and these two lines are
determined by the points that are closest to it, okay.
So once again to reiterate the point behind SVM is to determine this line or plane (in more in
more than 2 dimensions it’s a plane) and the idea is to determine this separating hyperplane
based on the support vectors. The support vectors are points on either classes (+1 and -1
classes) that are closest in terms of geometric distance to the separating hyperplane, okay.
This is what we actually optimized for in the SVM algorithm that is we want to make sure
that this distance is maximum, and this distance refers to as margin. So maximizing the
margin so as to determine the optimal separating hyperplane, so that is the idea behind the
algorithm for support vector machine.
1131
So how do we go about doing this, so the idea is similar to what we did for logistic
regression, in terms of the model it is similar but we will explicitly state the bias term, okay.
So this you can think of it as an equation of a line or a plane in 3D or a hyperplane in more
than 3 dimensions. So we will call this as W T X then we have the bias term b gives
W X +b=0, okay. That is the equation of this line, where the X is your feature and W is your
T
weight vector or parameter vector, b is your bias term, right.
Okay. Now what we do is, we fix the distance between this line and the support vectors. So
based on the support vectors we figure out equations of these two lines, we call them
W +b=1 and this line is W +b=−1, okay. So let’s put them at unit distance from
X X
separating hyperplane. The idea is then what we talked about was that, we want to maximize
is margin, okay.
So the maximizing this margin is then to figure out the distance from the optimal plane to
let’s say the closest support vector or you can think of some point on the plane to this line and
some point on this plane or line to this line here, the sum of these two distances is called the
margin and that’s what we want to maximize.
So how do you compute that distance? Then you please look up maybe from your high
school. Please look up how to determine the distance of a point to a line, okay. So if you
consider a point from here to that line, you can look it up, okay. And you will see that the
1132
2
total distance will come up to be , okay. So this is the margin, okay. So you consider a
|W |
point in this line and you calculate the distance to this line, and then consider a point on this
line and calculate the distance to this line.
We know the equations of those lines and you take any arbitrary point you will see that it can
be done, okay or the other way round also. Either way you should be able to calculate this
2
distance to be , okay. So this is the margin and we want to maximize this margin.
|W |
Remember in the process when we maximize the margin with respect to W then you will get
this value here, okay, and there is a constant, right?
So what have we done here? We have drawn these two supporting planes here such that those
lines are drawn by considering the points that are closest to the optimal separating line or
separating plane, okay. So then we are considering these points, that’s what we have used
(maybe this one) to draw these dotted lines, okay. So which means that there should be no
points in between these two lines, right. All class 1 points should be on the other side of this
line all class -1 point should be on the other side of this line. So that constraint is written as
Y ( W X+b ) >1, okay, where Y is your ground tooth label. So the idea is when Y =+1 (you
T
can think about it as when Y =+1) then W T X +b should evaluate some number much greater
than 1 hopefully, and that means that if you multiply these two Y ( W T X+b ) you will get a
positive number greater than one. Similarly when Y =−1 and W T X +b should evaluate to a
number much less than -1 and this again the product of these two Y ( W T X+b ) will give you a
number much greater than one, okay. So that is the point behind having this constraint. So
maximizing this quantity subject to this constraint or you can minimize the same, okay. So
this is the loss function for support vector machines, okay.
So again to recap what we’re trying to do is we are considering this classification problem
where the classes are linearly separable. So that’s an important constraint, okay. So the non-
linear case is handled by something called a kernel trick, okay. But if time permits we will go
there but otherwise we won’t consider it but I will just stop it linear SVM.
So we are only considering linearly separable classes, okay that means that we should be able
to draw line have a threshold or separating hyperplane between the classes if you are looking
at more than 3 dimensions, okay. And one of the things to consider is let’s say we have this
1133
line, (we just considered 2D for our conversation here), what it means is that if we consider
the distance from this line to the nearest points on either class (the geometric distance
between this line to the nearest points in either class), that distance should be maximized,
okay. That is the objective of the SVM function, okay.
So when we do that then we get this, not only you find out this hyperplane we also get the
support planes, I call them support planes because these are the points that are closest to the
optimization boundary are called the support vectors, okay.
So then if we have this trillions then it means that everything on the other side of these
support lines are belonging to a particular class and actually there is no point in between
them, okay. In the margin there are no training data points falling okay. So this is typically
used for binary classification plus or minus 1, if you have multiple classes then you do one
against the rest, okay. But typically SVM’s are usually for binary classification. The way to
formulate this, again like I said is to maximize margin, and to do that is you define these two
supporting planes which passed through the support vectors by these equations W T X +b=1
and W T X +b=−1. Basically you’re putting them at kind of a unit distance if you say, and we
find out the distance between these two lines in terms of W which turns out to be nothing but
2 (this is coordinate geometry, you should try it out.) And of course maximizing there
|W |
would be same as minimizing subject to the constraint that there are no points in between the
supporting planes, and that constraint is met by this equation Y ( W T X+b ) >1, okay.
So this is the objective function for SVM, again this is called as quadratic programming
problem and it is solved using Lagrange multiplier techniques. We don’t have time to go in
there because of that again we have to go through the derivations to figure out how in the end
the form of that of cost function will be such that we should be able to handle even non-linear
decision boundaries. Of course if time permits we will just look at it otherwise we will refer
it for later class or maybe I will just post you some resources to read from, okay.
1134
MLE, MAP and Bayesian regression
Hello and welcome back, so in this video we will look at MAP - maximum posteriori estimation
Bayesian regression. The material is inspired by PRML book by Christopher Bishop and many
images are also taken from the book, okay.
So we will just consider it in the context of linear regression, okay. So we have this model
typically this is how we model in the case of linear regression where ϕ ( x) corresponds to some
polynomial functions of x and any other functions of x or typically x, x2 , x3 so on, okay. We have
given training data pairs ( xi , y i) with i ranging from 1¿ m.
We saw earlier if the point ( xi , y i) are independent and identically distributed then the probability
of observing the dataset is nothing but the probability of the product of the probabilities of the
individual data points, okay. So this is the likelihood function, likelihood of data, okay given the
model. Okay. So that’s how we had earlier modeled it and when we took the log likelihood
negative log likelihood. We ended up with the mean square loss function if we model this as if
we model each one of these as a Gaussian function, okay.
1135
Now if we use Bayes rule, so basically we want to calculate p(w∨ y , x), okay given the data, so
y, x training data pair. Then using Bayes rule we can write it in this form, so where this is our
likelihood and this is the prior, okay. The denominator we can call it is as observing the
probability of data which is a constant, right?
So if we expand out the likelihood comes once again and up with products of individual
probabilities with a normalization factor here which is again evaluates to constant because it
depends on the training data set alone and not on w, okay. So it has a scaling impact, so we can
always observe it as a constant, okay.
1136
So now we take we think about what do we do the prior.
So if we go back again, we saw I said this was a prior and exactly mention what it was. It came
out of the application to the Bayes rule to the posterior probability. Just to mention again this is
the posterior, posterior probability. So the prior same to just come out of the application of Bayes
rule. So we have to see what it means, okay.
1137
So if you recall the polynomial curve fitting example we had as the degree of the polynomial
increased there was overfitting which again let or very large values of w, okay. Remember very
large values of w. It was desirable at that time that the values below, right? Because w is making
very large to compensate for individual data points. So it was overfitting.
So how can we enforce that? We can enforce that by saying that the prior should be a Gaussian.
Okay with zero mean and standard deviation which we can rather estimate or we can either set a
parameter by cross-validation, okay. And in this case, we can assume that the covariance matrix
of the Gaussian is the diagonal. So how does it help to avoid large values of w?
So if we impose the prior that w is drawn from a Gaussian distribution then we are looking at the
functions something like e raise to remember zero mean, so some normalizing factor here, okay.
And I’m also not writing out let’s say some alpha squared the variance parameter alpha squared I
2
am not writing that out. But you can just look at e−w .
2 2
I
A e−¿∨w ¿ /α
So if you plot the PDF for w on this axis again w can be positive or negative, we will see that, it
can be something like this, right? This is what you have seen. So for very large values of w , so
this is the PDF. So for very large values of w the probability of getting that w drops off
exponentially.
1138
And we can actually control these width as hyper parameter, so then we can actually make it
very sharp also. So that we can only get very small values of w . So by imposing the prior, we
make sure that there is no overfitting that’s an advantage of using a prior.
So to recap we want to estimate the, what we call the posterior distribution like we want to
directly estimate w , this is what we do when you do the optimize (gradient descent), we actually
estimate w. So from probabilistic point of view we want to estimate w given the dataset y, x.
Now if you use Bayes rule and rewrite it in this form, so it turns out to be a product of likelihood
and prior.
We saw that by taking the log likelihood, we end up if the likelihood function is modeled as a
Gaussian. The log likelihood leads to the least-squares loss function. And that’s what causes
overfitting in many of our problems. By looking at the posterior ,we have this term prior, this is
multiplicative.
1139
And what we do is you can impose any kind of prior on it, so it has to be a probability
distribution. One of the most commonly used prior is the Gaussian distribution. And if you plot
Gaussian distribution you will see that for very large values of w the probability of getting the w
kind of goes to 0, okay.
1140
So now we will just explicitly model this p(w) is N is the normal or Gaussian distribution with
some α 2, a diagonal matrix is the covariance matrix. Again it can be learnt parameter or a hyper
parameter then if we write out the posterior distribution of w given the training data, so it can be
written in this form once again I pointed out that the denominator is again a constant.
And again recall that you can also model the likelihood here as a, again it is another Gaussian
estimate of the mean is given by the model w T x , here I should to be more general I should write
T
w ϕ( x), okay. You can put that in there. Again we assume a, in this case to some σ 2 I parameter,
okay. Some σ 2where all the variants are the same, okay.
1141
So again if we take the negative log of a posterior probability then it, you see that it easily
simplifies to this because it is a log of 2 exponents, so then the exponent comes down then you
see that it comes to this loss function which is your least squared loss function plus L2
regularization term. So recall that when we did the polynomial curve fitting by adding L2
regularizer we are able to prevent the weight from blowing up and also to prevent overfitting,
okay.
So which is the same as evaluating the posterior probability or the maximum aposterior
probability or MAP, okay. So you can actually solve this problem of using gradient descent and
you get a value of w which we call the MAP estimate, okay. That’s one way of solving this
gradient descent.
1142
However, actually if you look at this log probability. The log probability can be rewritten in this
form. So what is this? If you recall we have seen expressions of type : ¿
This is the exponent in a Gaussian. So it turns out that if we use the Gaussian model for the
likelihood and a Gaussian prior it turns out at the posterior also is a Gaussian.
And we can actually describe it as a trick called completing the squares then you can get into this
form. From which you can directly estimate ~
wthis is again the same as w MAP and the
corresponding covariance matrix, okay. Okay, so if you see that depends on the covariance
estimate which contains contributions from both the data and also our constraint that we impose
in the form of a prior the α 2 and it goes into determining the ~
wor the w MAP estimate.
So this is the same as~

w. ~
wis the same as the w MAP estimate. It is the mean estimate that we have,
okay. But once again recall , see that it is still a point estimate of w, so you will get one value of
wand you use that w to actually calculate for every new test data you will do ϕ ( xtest )to get a point
estimate of your output y, ~y that is typically what you get, right? That’s what you do all the
time.
^y =W T ϕ( xtest )
Now that is nice but then we will go one step further and do what is called Bayesian regression
where we do a predictive distribution on ~
y. So what we want is an error bar on your estimate ^y .
1143
^y is what you predict but it is just one number that is thrown out even if you use MAP or
whether you use MLE just one number that comes out. What would be interesting in many cases,
useful in many cases to have an error bar on your estimate and that’s where Bayes integration
fully Bayes integration comes in.
But we will look at this MAP estimate with a slide example, okay. So let’s consider this problem
where your model is w 0 +w 1 x , okay. So you are given training data points ( xi , y i ), okay. So in
this toy example, w 0 +w 1 x , were fixed for various values of x you get generated y i, xi would
generate y i and then some noise was added to the y i, okay. So that is a data point.
^y =w 0 +w 1 x -> ( xi , y i)
So then, okay we start off with our prior, right? So every row the leftmost column is a
likelihood, the middle column is the prior which also doubles up, you know it doubles up the
posterior also and the right one is your predictions if you can think of it that way, that’s the data
space. Okay, so first where the prior, as we saw we have a Gaussian prior, right?
We have a Gaussian prior, that’s what if you plot this out, this is what this will look like. Where
w =0 that’s where it is, we will correspond to the maximum value and it will drop-off
exponentially as you go away from zero, okay on what the axis. So the axis, this axis w 0 that axis
1144
is w 1. So we are plotting this e exponential distribution as a f (w 0 ,w 1), so that’s our prior
distribution.
From that prior you draw 6 samples and you see the red lines here. The red lines here
corresponds to the 6 samples of w 0, so you will draw 6 pairs of w 0, w 1, okay 1 to 6. And for each
one of them you can calculatew 0 +w 1 x . For your training data, you have training data X, you can
do that or for many or the arbitrary values of x you can calculate.
So if you do that than for every x there is a y, right? x, y so using that you can actually plot a
line, right? So this is the equation of the line, so y=w 0 +w 1 x . So if you draw 6 pairs of w 0and w 1
from here you can get 6 lines in the xy, so this is x and this is y, okay. That’s the data space,
right? So we actually have the training data, right? Let’s say we observe one data at a time.
So, in this case, we have let’s try to highlight it in something that is visible. So if you look this is
the data, this is a white cross I don’t think it is very clear but that’s an observed Datapoint, okay.
So you have one data point, okay that you got and if you have the Datapoint then let’s say you
have one x, right? One xy, you have one xy, there has been observed, okay.
So then what can you do? You can calculate the likelihood, how do you calculate likelihood?
You remember it is some exponent, I am remembering all the constant y minus w transpose x
squared by some Sigma squared, okay something like this. So you can actually calculate the
given x and y for various values of w 0and w 1 you can calculate this likelihood.
And the dark red bands corresponds to regions of maximum likelihood, for this one Data point
that we have observed. So now you take this likelihood and multiply it with the prior, okay. So
one likelihood times prior gives you the posterior distribution which is this, okay. So now you
can draw 6 more points from this posterior distribution corresponding to w 0and w 1.
6 pairs and you plot 6 different lines here in the xy plane. So then you go back here you observe
one more Data point, you can recalculate, again you can calculate the likelihood for that Data
point again it gives you slightly different likelihood distribution. So now this is the new prior.
This prior times your likelihood will give you your new posterior distribution from which once
again you can draw a 6 pairs of w 0 and w 1and plot this line right here, okay.
1145
So you can keep doing that after the 20th Datapoint you will see that the posterior distribution has
become very sharp, sharply peaked, right? There is one, everywhere it is zero where it is very
maximum value at some point and if you sample once again 6 data points from this distribution
you will get these lines here, these red lines here which are actually fitting all these data points if
you think about it, okay.
So by sequentially considering data points x,y and each time evaluating your posterior
distribution but then using data set prior for the next time that you observe a new Data point, you
can actually come to the, converge to the appropriate posterior distribution. Now, this is how you
do MAP estimate, if you think about it probabilistically. Another way of looking at it if you think
about it, you see that by sampling from w, right?
Now that you have this distribution you can keep sampling from w and you can keep estimating
for every x,y, for every x because estimate y, for every x you can sample from w many many
times and you can estimate a y, okay. So that kind of gives you an error bar on your estimate,
okay. But those error bars will tend to be very similar across-the-board because w is completely
determined by your training data.
So the new x for which you are breaking ythere is no impact on the error bar, okay. So you will
get a similar error bar for pretty much depending on the training data set that will determine your
error bar, okay. So if you think about it is just a point estimate. How do you get a point estimate?
Now that you have very sharply picked distribution for w 0 and w 1, you can look at arg max. So
what is the w 0 and w 1 which corresponds to the maximum probability, okay. You start that space
and you will be able to determine the pair w 0 and w 1 that is still a point estimate, okay.
1146
What we want to do next is, since we still only have point estimates. What we want like we said
is, we want the predicted distribution, okay. predicted distribution means we want a distribution
on our prediction. So that is probability of Y test given xtest , right? Crudely speaking and your
training data I’m going to call this D and the parameters w, okay that’s what we want.
We want an error bar on our prediction, if you think of it in a very simple sense, again we want
an error bar on our prediction directly, okay. What is the error bar on the prediction? So the idea
behind Bayesian regression is to determine the prediction distribution itself, okay. So how do we
go about doing that? Okay.
1147
So if you pay attention to this, what we actually want? We want this distribution. So we want for
a new xtest what would be the probability of y test ? The output for new Data point given the new
Data point and the training data, that’s what we want, okay. So that we can write using product
rule of probabilities
❑
So this is some rules, so p( x)=∑ p(x , y). So we introduce this random variable w which is the
y
estimate based on our model and then we marginalized over it. So the summation is just replaced
by the integral, okay. The 2nd rule is a product rule where p( X , Y )= p(Y ∨X ) p( X). So in this
case p( ytest , w), we use this expression to decompose into 2 products.
p( ytest , w)= p( ytest ∨w) p(w) given x,y, okay. So this is the short derivation for how to get to the
predicted distribution. Now, why do you want it in this form? So we will go one step further, we
have written p of y test give x test xy Yes as p of y test given x test and w, so we have left of this
training data.
The reason is because it doesn’t really matter because catches all the information in the training
data, okay w catches all the data. So we can still estimate w from the training data might is not a
problem. So we can just leave out the dependency on x,y because it is already being taken care
of, so we have this, okay. And this we know is the posterior distribution.
1148
This we know, we have modelled is. So can model this as a Gaussian, this also we can model as
a Gaussian based on your, this is kind of your likelihood function, okay. This is another
Gaussian, okay. So once we have this estimate, so what does this do? What does this p y test,
what is happening here? So we can think of it as, let’s say we have a quantity here f of x and we
want to estimate the mean of that quantity, this is what we would do, right? With respect to the
probability distribution.
So what we are trying to, we are estimating a mean of this quantity. What is the probability of
p( ytest )? Given this input xtest and a model w, okay. That we are calculating in average based on
the probability distribution of w condition on your training data. So that’s exactly what we are
looking for, okay. So your p( x)= p( ytest ∨xtest , w) p( x), p( x) is nothing but your posterior
distribution.
So you can think of it as a way of calculating an average over all possible realizations of w given
your training data, okay. So what can we do with it? So we can actually do arg max of this and
figure out the y test for which probabilities maximum that is estimate of y. We can also calculate
variance, right? Because remember we can calculate moments, right?
We can calculate mean then we can calculate f of x squared, remember. P of x Dx, right? And if
you’re zero centered this is a variance, so for every new test data point we cannot only calculate
by using Arg Max the mean, we can also calculate the variance, this is possible of course
assuming that both of these are Gaussian. It turns out that this integral that you see here generally
intractable because you cannot solve it for everything.
But if you assume that these are Gaussian then this can be actually been done and it turns out to
be another Gaussian, okay. We want to look at exactly what the forms are because it’s little bit
confusing but the general idea, the idea is if you estimate this distribution using Arg max we can
get the mean assuming that it is Gaussian and you can also calculate the variance. So for every
new test data point, okay.
1149
So then if we look at the output of Bayesian regression , so the Green is a true curve which we
have noise added and we have sampled this blue circles we have seen this before. The red is the
mean estimate, right? That is the mean estimate for y and this spread for every Data point is
available. So the variance of the estimate of the mean for every one of your test data points.
So that’s a prediction as this error is band which is given here that’s an error at every point on
your prediction. You see that the band neatly encapsulates pretty much the Green curve. So
that’s the advantage behind doing Bayesian regression.
1150
So what is the biggest problem in doing Bayesian regression is, except for some forms of
p( y∨xtest , w) this form is very similar to the likelihood function we can model this as a Gaussian
except for this forms wherein these are Gaussian are some very specific distribution this integral
is very hard to do, okay. So then is are done with what are called Monte Carlo techniques.
So numerically intensive but the advantage is, once you estimate this distribution you can do Arg
max you can do mean of the estimate, right? You can do mean of your estimate, you can also
calculate the variance of your estimate, also note that, you see that unlike in the MAP or MLE
estimate your output. The probability of y test depends on your current data point also, right? So
that’s the interesting thing, right?
So it takes in that it is influenced by the data point. So if you actually write this out for a
Gaussian models both the posterior as well as this distribution being Gaussian you can actually
see that influence directly, it’s in the textbooks I will give you the reference that in the terms,
okay you can look that up.
So this and a brief look at you know Bayesian regression. As far as Bayesian regression is
concerned the idea is to get a predictive distribution for your output, so you are you are doing y
hat we want an error bar on that y hat that’s what we want. The MAP estimate is basically trying
to maximize the posterior distribution, okay. and the advantage of doing that is that you can
accomplish regularization okay.
1151
So by imposing a Gaussian prior you got L2 regularization, you can impose some for instance a
exponential prior you can get L1 regularization wherein most of the coefficients will go to 0 but
the probability of getting coefficient is very small. So that way that’s possible, okay. So these
techniques are applied widely in different scenarios, this is just in the context of linear regression
you have seen it.
The treatment in bishop is rather dense but it is actually one of the best treatments available, so I
urge you to go through it I will also provide you with the references for the textbook. Bishops
textbook by the way is available online for free, so you can look through that, okay thank you.
1152
Introduction to Generative Model
Hello and welcome back, in the next series of lectures we look at generative models, especially
the deep generative models and in this video we will just take a look at what they mean and what
they are used for.
1153
So is a brief outline of this lecture, so we have some background to cover and see what are the
generative model is and why we need them and the types of generative models will be covering
as part of this course okay.
So, so far, you have looked at problems wherein we have this pairs available X and Y, where X
is the data, it will be the multidimensional data, images, whatever form of data that you are
having, that you have with you and the label associated with it, so typical task is a classification
task is yes or no or like we saw a imagenet data you have 1000 classes, so you have a input X
and the corresponding label is given, so these are labelled datasets and what we use the deep
neural networks for is to learn the mapping from X to Y okay.
So X is the input and if we have a deep neural network which processes the input and output say
probability score which we threshold and make that into a class label right, so this is in the
context of supervised learning right, this is what you have seen, so another way of looking at it
is, that the deep neural network defines a classification boundary like shown here in this graph,
wherein we have two features, feature one and feature two this could be some arbitrary numbers
after scaling.
The red and the blue dot are the two, or the datasets or the data points corresponding to the two
different classes and what the neural network does is to determine this boundary if you can look
at it that way and for which we have access to both the data, as well as the label.
1154
We have also seen that, you know unsupervised learning wherein there is no associated label, so
Y is not known right, Y is not given and what we try to do here in this situation is to learn some
underlying structure in the data right, so that falls under unsupervised learning, some of the
techniques we can also call them clustering technique is what typically use in that context, so, for
instance, K means clustering wherein, which is specify the number of clusters that you hope to
find in the data and you end up with a labelling accordingly right.
So in this case we have again two features and what K-means algorithms does as is to, if you
think that there are two cluster present in this data, the K means algorithm determines the centre
of those clusters right and is able to label each of the data points based on the distance from the
cluster centres, as belonging to either one class or the other. Okay, so here we are just trying to
determine some underlying structure to the data right and we do not have access to the labels of
the data, we just have the raw data available and we try to find some underlying structure alright,
so, in this context, clustering is what we have seen, feature learning or density estimation all
these problems fall under unsupervised learning.
1155
So part of this unsupervised learning is the density estimation problem. Okay, so it falls under
unsupervised learning and the idea is to determine probability density model for a data from
which you can draw new samples okay and you determine that probability density, your model
density from the given data density, so you have given data X, so this is, let say that you have N
data points, 1 to N data points and so that is basically X i are given, which is an approximation to
your, from which you can get an empirical density right, that is P data of X okay.
So we do not know the true P data of X for which you can infinite number of points, so for all
possible access that you get your hands on or that exist which is not possible, so you with the
data available, you can given empirical density estimate of the given data, but we what we want
is to determine a model from which we can sample the data points okay, so which is basically the
generating sample, so hence word generated, terminology generate to models right.
So given probability, so given this data points okay, so again we would not have access to the
labels for say some most of the time we just have the raw data and we have the index data and
what we want to do is to figure out or have model for the underlying probability distribution
which is that is what you want, so that we can draw new samples from it, which are similar to
our training data. Okay, so we assume that our training data is representative of the problem that
we are trying to solve okay.
1156
So why this useful and do we do with it, so before we proceed there we can also, there is another
way to look at it, which is to say that the generative models learn the joint probability
distribution p( X ,Y ) P of X, Y, whereas, the your classifier for instance right, your if you are
using a deep neural network to classify data into one of K classes and the output of neural
network, the scores that are output by neural network can be interpreted as the posterior
probability of the class given in the data, so that is P(Y ∨X )P of Y given X okay.
So that is a, so that is not a generative model right, so what we actually want is this, of course
this are related through base rule here right, so if you have a good model for a prior P( X ) P of X
and we can actually figure out this, so this is one way of looking at it, so what we are, so this is
what is called a discriminative models right, these are called discriminative models, in the
context of classifications, so what is the probability of label even in the data that is what are
typical machine learning algorithm outputs.
So to most of your deep neural networks that you use for classification but we do seek something
of this sort where we have, where we actually want to model the underlying data distribution that
is P( X )P of X, typically is what we want to get.
1157
So why do we need relative models? Okay, so let say basically we are now trying to look at what
kind of applications can they possibly serve, so the most common application is they can
generate new nice pictures right, we can also use them for image to image translation, there is
shown in this examples or for instance, we give them labels or outlines can you generate new
data, so, for instance, blacken into transform black-and-white to color
So it gives the aerial map of a city, can give you this, you know this kind of output as you see in
Google maps or difference between night and day, here day and it has the same night time and if
you give labels, once again, if you train the models, so that if you give it an outline you can give
a output that you see it is a new design for handbag if you can think of it that way, so there is
aerial to map, you know labels to street scenes, labels of a facades, the black-and-white to color,
so these are some need cool applications that you can think of if you say a generative models
alright.
1158
So, but if you, but the more deeper applications, for instance, generating speech from text right,
so if you have say programs like an automated assistant that response to your phone call, that is
the kind of program that it could require, so it has some text stored in digital format and should
now convert to your speech signal in order to respond to a caller right, so for raw audio okay, so
you would like to have a generative model, if you think of it.
1159
Generating sequences of text, so here is one example, but if you can think of it, a simple
application, you know it is your auto complete in your cell phones or when you writing mail
programs that you are using might have that, so if you type a few words, it should be able to
complete the word or you may be even the sentence, that is also a generative model right, so
those are some of the more obvious practical applications of the generative models, okay.
Another application would be super resolution in the sense you are given, so this can think of
two different ways like, so here you are given a low resolution data and ideally you want the
code to generate high-resolution data of the same, you can also think of image inpainting,
wherein if there are some holes in the images or some damage areas in the image, you would like
to read that, so if you have a generative model, it should be able to figure out what the missing
data is, so you can think of it as a data imputation problem also right, so these are some of the,
you know very high impact or also some first-order application for generative adversarial
network, okay.
1160
So in the context of medical imaging, this also has, it is very powerful uses, so some immediate
ones that has also been reported in literature, so for medical image applications, you not only
need patient data that has to be obtained with patient consent and there is a lot of, you know
work that has to be done in order to make sure that you are sampling the correct patient
population and all that.
So given that it is hard to obtain labelled patient data right, so in fact just to obtain the patient
data and then of course you also have to label them. That is even more difficult problem, so that
you can train a classifier but it would be nice if you can actually generate data that is generate
images of anatomy, so you have, let say brain tumours right, so you want to analyse brain tumour
or liver tumour, it will be nice if you can have a generative algorithm for generating images of
the liver or the brain, even regular anatomy that will be nice, so that we can actually do at least
anatomical level segmentation okay.
So that will be a very immediate application for generative models, there is another neat
application would be image to image translation, here it might, this is a little bit complicated
from the, try to understand this, so in many situations, you know you will have a lot of data for a
particular modality, so let say CT images of the liver. Okay, they require a few scans would be
easily available, so large number of, even unlabelled data is easily available right.
1161
So then you can train a model to whatever diagnostic tasks that you wish to complete right, let
say CT images of the liver, you have used CT images of the liver to, let say segment the liver
right, just to mark the anatomy of the liver, what then the new MRI scanner comes in and you are
starting to take MR images of the liver, which is generally maybe, not so easily available right,
so in order to train a supervisor classifier for liver segmentation from MR images of the liver you
need a lot more data, which might not be available.
Then, so a generative model in this context can be used to convert or translate the MR image to
CT image, use the CT image segmentation model and then transfer the segmentation mask onto
the MR image, so that is a possibility, so you have spent, you have trained an expensive model
with a lot of data and a very similar problem comes along, it will be nice if we can use that train
model where you spend a lot of time and resources doing that alright.
Of course this does not work for everything, so just to for those who are not in the medical
imaging field, so you cannot train, you know the CT images of the head, let say you have
network that does some diagnostic tasks on CT images of the head and then you cannot take MR
images of the liver and use that network without, of course you can do, find tuning and then
change, that is different transfer learning problem, we not directly use that network, there is no
image to image translation here we have to use it in a correct context, so here are looking at CT
images of the liver and MR images of the liver, than it is more meaningful to do that.
Of course these are slightly more research oriented applications, but very high impact if they
have are solved, so not only in this context, but there are various other context in which, wherein
generative models are used specially in medical imaging task, where very useful, especially in
scenarios where we were, you do not have too much data, this might come in handy.
1162
So okay, so what are we types of generative models that will cover? Right, so will go into
details, a couple of these techniques, not all of them and so how they can be used to generate
data, so most of the applications will be looking at, would involve generating images okay, that
is the most common application, so far, we will not look at text or speech, but just generating
images, so that context, there are different types of generative models, so this is one of the more
highly cited models PixelRNN or PixelCNN, they are called, they are based on what are called
autoregressive models. Okay.
So they are called explicit, so they explicitly determine the probabilistic, the probability density
of the data. Okay, so there are called autoregressive models because they, if you think of an
image, you try to generate a particular pixel based on, are the pixel you have looked at so far, so
if you raster the image row by row or column by column, try to predict a pixel which is the
second column by considering this is in the first column of the image, that is the basic principle.
So it is comes under autoregressive models and it gives you an explicit density images basically
the neural network itself, the deep neural network, the densities model using the deep neural
network right, so just to give you an idea, so, for instance, if you are working with some simple
data, you can model that data using a let say a Gaussian right, so your density would be
something of this sort okay, right something of this sort.
1163
p( x)=e ¿¿ ¿
So instead of this, we will just have some of X which is basically nothing but deep neural
network okay, so in most of the applications, most of the generative models using the neural
network, the idea is to model P(X) P of X with the deep neural network itself, so the network is
the model okay, so another class of generative models which also explicitly determines density is
the latent variable models, one of them are more highly cited work is the variational
autoencoders.
So we will cover this in the next few videos, so variational autoencoders again, they are, it is
called variational autoencoders because there is a, it is an approximate way of determining the
density function right, so that one and we see why it is called latent variable model when we
actually look at the algorithm in detail, of course, the most studied model in the recent past is the
generative adversarial networks. Okay, here we do not explicitly model the density with the
probability density, it said we just sample from it direct okay, so the neural network outputs the
sample right, so directly outputs the sample, so that is the idea behind GANs and again these are
deep neural network and primary has been used for generating images.
So, in fact, I have not seen any application. Otherwise, but it is the more popular application are
for generating images and there are being very recent successes for generating images of human
faces and they look very realistic, so you can look them up if you do a search, so we will
primarily focus on VAEs and GANs okay, so variational autoencoders and generative adversarial
networks will be the two generative models that we will focus on, so to summarise, so generative
models try to model the underlying data distributions.
So data is basically your training data, so we want to figure out this, we want this, where X is a
training data right, so that we can draw samples from the probability distribution which looks
like you are training data, so that is how we should interpret generating models, so once you
have that model and there are a lot of applications that are possible, that is one of I showed you,
and they can also be used as classifiers, so it can also be used in the context of classifying, of
course, it is much more straightforward to, if you have labelled data, it is much more
straightforward to train the classifier, directly using a deep neural network, but it is a possibility
to know that using the modern that you are figured out okay, so we will continue with this in the
1164
next few lectures, will look and various how-to encoders and generative adversarial networks.
Thank you.
1165
Generative Adversarial Networks (GAN)
Hello and welcome back, in this video we will look at generating adversarial networks, these are
class generative models which do not explicitly model the data distribution, but rather provides a
sample from it and sampling is performed using a deep neural network, the neural network which
actually provides a samples, takes as input a random noise vector and then maps it into a sample
of the model distribution.
So let us say we have given, you are given training data right, pdata ( x), which is in the provided
based on the samples provided, so xi where i is 1 to N, what we want to do is to model some
provide a, to want to do is determine this model, pmodel ( x), so that it is a good approximation of
pdata, the true distribution okay, so this pdata ( x), once again, we do not actually have access to all
the data that is possible, so we only samples from pdata ( x) and we want to determine some pmodel,
which basically the probability distribution of X.
A model for that, so that we can sample from it, now in this case we do not explicitly model, so
in a sense it is not a parametric model, but rather this is accomplished using a deep neural
1166
network, neural network actually generates a sample from the model distributions, so we will see
how this is done.
The generative adversarial networks, framework consists of two neural networks, one the
generator and the discriminator, the function of the generator is to take as input, a random noise
vector and transform it into a sample from the model distribution and the discriminators job is to,
is actually act as a classifier wherein it tries to determine if its input data X came from the
generator, we call them as fake samples or was it from the actual training distribution because of
the real samples okay.
It is called adversarial because the generator is constantly trying to fool the discriminator into
believing that, into making the decision that input generated by it is from the training
distribution, while the discriminator is constantly trying to learn the decision boundary, so that it
always determine whether, it is called an adversarial network because it has these two networks
generators and discriminators trying to work against each other like as I mentioned earlier, the
generator constantly trying to generate samples that will fool the discriminator into classifying it
as coming from the training data distribution, so in that process the weights of the generator
learns a transformation which enables one to convert the random noise input vector into a sample
from the model distribution.
1167
So just to have an illustration of what we discussed, so the generator G takes as input, the
generator G takes as input a random noise vector also referred or denoted by Z, also referred to
as latent space, these random noise vector is then given the, which acts as input to the generator
G, which then outputs some samples, generated samples of the, which are hopefully, similar to
the training data distribution.
The discriminatory D takes as input your training data, so again, this are samples or nothing but
your training data, which are made available to you, so, for instance, if you are interested in
generating faces, then you would have a database of faces of different people and that would be
the input of discriminators, so it would be a general-purpose algorithm, it has a specific to some
certain task, so the training data. It takes as input training data, which are labelled as real and the
generated fake samples again they are labelled as fakes or we can say this is zero label and this is
the one label.
The discriminator alternatively tries to, takes as the input the samples, the training data samples
as well as the generated samples and outputs an error function, so basically it is the output of the
discriminator, which are basically outputs a probability of particular sample being real or fake
ranging from 0 to 1, now this output is what provides the signal, or the error signal to train the
weights of the generator as well as the discriminator.
1168
So how is that done? This problem is formulated as a zero-sum game, so to speak, because the
generators, if you denoted by J Gas the cost function of the generator, then it is basically the
negative of the cost function of the discriminator which is denoted as J D so the cost function of
the generator is what is given here basically and this also referred to as value function, it is
actually a function of two sets of parameters, one corresponding to the discriminator and other
corresponding to the generator.
J G =−J D
(D) (G)
mi nG ma x D V (θ ,θ )=Ez ∼ p logD ( x)+ E z∼ p( z) log (1−D (G(z )))
data
So this is optimized alternatively, there is an inner and outer loop, so the inner loop is
maximizing this value function with respect to the discriminator network parameters, the outer
loop is minimizing this again, the same objective function with respect to the parameters of the
generator network, so let us take a closer look at the cost function itself, this is the first is
logD ( x), which is nothing but the, let us take the closer look at this cost function.
So if you look at this cost function, which is E z∼ p logD (x)+ E z ∼ p(z) log(1−D(G( z))), so what
data
this means is that will calculate logD ( X )with respect to the training data samples and you will
calculate this term with respect to the samples generated from Z. Okay, so this is very similar to
1169
the binary cross entropy assuming that there is an equal number of generated images and equal
number of training data samples okay.
So we will see how this cost function makes sense, minimizing this or maximizing this cost
function make sense in the context of the generative adversarial network, so here, D(G(Z)) is
basically the output of the discriminator when the generated images are given as input, so D( X )
goes from 0 to 1 basically you can think of it as a probability of this particular input sample
belonging, either being real or fake and similarly and using the generated samples it would be
D(G(Z)), which again go-between 0 to 1 okay.
So if you look at this cost function, so when we start training, so if you look at it, the
discriminator, ideally the output of discriminator should be 1, whenever X comes from the
training data distribution and the output of the discriminator should be 0 whenever the input
comes from the generator, so initially when the discriminator is not sufficiently well-trained, the
weights of the discriminator or not are still random, then let us look at a particular case, test case
wherein you have real data here and some the generated data which are given as input, what is
shown here in this black hyphenated lines is basically the decision boundary given by the
discriminator.
1170
So here there is one misclassification real data, here I write here and there is one
misclassification of generated data. Okay, so everything to the left this line, left of this line is
basically class 0 everything to the, sorry the everything to the left of this line is class 1 and
everything to the right is class 0, now when the discriminator misclassifies, so which is the
D( X )=0, when X comes from the training data, ideally the output should be 1 but instead it says
0 and you can see that logD ( X ) becomes a very large negative number right.
Similarly, when you take the data which comes from the generator, ideally the output should be 0
but then if this misclassification then once again D(G(z))is close to 1 which means that
log(1−D(G( z))) becomes a very large negative number correct, so this is the case when the
discriminator is not performing optimally.
On the other hand, let say it is trained very well and you see that the samples are correctly
classified, the same input real data, as well as fake data here, you see that this is the decision
boundary right here and everything above is class 1, everything below is class 0 which in the
case decision boundary is correct, then which is the D( X ) close to 1, then log D( X) is close to 0,
similarly D(G(z)) is close to 0, then log(1−D(G( z)))is close to 0.
So, that now you are cost function varies from a very large negative number minus infinity to a
close to positive number, which is like in this case is 0. Okay, so maximizing this cost function
makes sense, so that maximizing this will lead to the discriminator performing optimally.
1171
Similarly, when you try to look at minimizing the same value function or cost function with
respect to the parameters of the generative network, now if you take the first term, it does not
have any parameters of the generator networks, so we will not consider that, so will only look at
this particular term here, so minimizing this term, what does it mean? Minimizing the likelihood
of the discriminator, classifying the fake samples as fake, basically that is what we are trying to
do here with this cost function.
So this is minimizing the cost of correctly classifying G as 0. Okay, however it turns out that this
cost function saturates very quickly. Okay, because initially when the generated images are of
very poor quality, then the discriminator has no problems figuring them out as belonging to class
0 okay, so then what happens is that the output of the, the output saturates, so if you take the
derivative which is what you raise to the signal that we back propagate to the network, derivative
become 0, so there is not much to back prop okay.
1172
So instead it is placed by maximum oflogD (G(z)), so what this does is to maximize the error of
the discriminator network, maximize the error in the sense that, it incorrectly classifies D(G(z))
as 1 instead of 0, so ideally what we are trying to do is to force D(G(z)) to be close to 1 rather
than it being close to 0. Okay, that is what this cost function does.
So maximizing the error of your discriminating network is what this cost function does and this
provides, this is a heuristic and it actually makes it better for training the neural network, so we
can just check that is, so once again your real data and fake data being fed into the discriminator,
so when the network, discriminator network correctly classifies the output from generator as
being close to 0, then it becomes a very large negative number, then log D(G( z))is a very large
negative number.
On the other hand, when, let say the generator has progressed to a point where it is generating,
very realistic examples, in that case, let say the discriminator comes incorrectly in this case
classifies the output from the generator as belonging to class 1, then you know that logG (z )
becomes close to 0. Okay, so then again, once again maximizing this cost function with respect
to the parameters of the generator network leads to the generator managing to output samples
which are very close to the training data distribution okay.
1173
So let us look at this learning process in the terms of one 1-D distribution, this is again from the
paper given at recited at the bottom, so let us we have this data distribution in black shown here
and let say this is the model distribution from which, which is learnt by the generated network
and this is the, the blue line is the output of the, this is the discriminator response. Okay, we can
take this as decision boundary, so initially when the training is, not initially when the training is
not great will see that there is a misclassification by the discriminator okay.
So the misclassification we can judge by seeing that all points to one side of the discriminator
decision boundary is classified as real, have poorly points on the other side, let say this side is
classified as fake okay, so then after updating the discriminator so you do a several epochs to the
discriminator network and it gave rise to a decision boundary which is much better right now, so
which is able to correctly classify samples to some extent from the real versus the generators
output okay.
So then what we do is in this case just to have explain further, so said in this X here, this is
looking at 1-D problem, so this X, this is Z, this is the random vector, this is the space from
which we sample the random noise vector, so the generated network maps this to points in X, so
X is the data, it wants to data axis and Y axis here, here is the probability okay, here is the
probability density function that is what we are looking at.
1174
So Z is mapped to X by generator and it gives raise to this green line. Okay, which is what we
are trying to change, so after updating the blue line, the blue dotted line is the discriminator
response, it is getting better and discriminating between the data distribution and the model
distribution, however, another epoch of the updating the generated network leads to the green
distribution moving closer to the dotted black distribution, which is basically the model
distribution is correctly approximating the true data distribution and once training has been and
once when both the discriminator and the generator have being optimally trained.
Then we get to a point wherein the green and the black dots are coincidental and the
discriminator is unable to clearly say which is what, so in the sense that the output of the
discriminator is always 0.5, ways it is not clear whether the data belongs to the training
distribution, training data distribution or it came from the generator okay, so this is what the
process is, so you alternatively train the discriminator and the generator to point where as an
equilibrium wherein the discriminator is not able to distinguish between the samples coming
from the training data distribution, or whether it is coming from the generator distribution, the
distribution approximated by the generator.
So just to work you do, this is again from the paper and just to work you through the steps
involved in the algorithm, so you sample a mini batch of noise sample, so remember the input to
1175
the generator or this random noise vectors is of a uniform noise or Gaussian noise, you sample M
of them, M is your mini batch size and you also sample a mini batch of M training data okay.
So again once we sample of course we run it to a generator two, give outputs in the form of the
data, so then you update the discriminator by ascending on its stochastic gradient, so we saw that
we maximize the probability of the discriminator when we are trying to train it with this cost
function and once that is done this again for K types, so this is that loop we are in for K steps and
once again we sample a mini batch of M from the noise prior, this is called the noise prior, so
this distribution from which you sample Z is called the noise prior and then you update the
weights of the generator again doing.
In this case the original cost function is log(1−D(G( z))), remember that we replace that by
logD (G(z)), so we have to do gradient ascent on it actually, not gradient descent, so this is
gradient ascent, to maximize this cost function, so this alternates, so typically you will do 1 step
of are several 1 or 2 steps of the discriminator and then go back to the generator and train its
weights okay, so in this particular construct, this gradient construct both the discriminator and
the generator are neural networks, so usually stochastic the mini batch gradient descent is used
for updating the weights of the generator as well as the discriminator.
1176
So, will look at one very popular implementation of this GAN its called DCGAN or deep
convolutional generative adversarial network, so this is one of the first work inside over, cited
here, it is one of the first publications to use a deep convolutional network to generate actual
images okay, so the original paper which talks about GANs, used MNIST and it did not use such
the deep convolutional networks okay.
So there are some heuristics that they, the authors figure out some of them are listed here, so they
replaced polling layers in deep convolutional network with strided convolutions okay, in the
discriminator you have strided convolutions instead of Max polling and in the generator you
have transposed convolutions, remember that we start with a random noise vector and we have to
actually generate images, so you need to have transposed convolutions.
Use batch normalization in both the generator and the discriminator, they removed most of the
fully connected layers and they use ReLU activation in generator for all layers accept the output,
which uses a tan hyperbolic and use leakyReLU for the discriminator for all layers, this is
particular a heuristic seem to work very well for them, I urge you to read that paper where they
have, able to generate images that are not part of the training distribution, but still look very
realistic okay.
So we will just look at the architecture quickly, so we start with Z, you sample Z from a
distribution about 100 points or 100 dimensions and you have to re-project it to a volume of size
1177
1024 feature maps of size 4x4 and then you use strided convolution, in this case they call it
fractionally strided convolution or transposed convolutions to increase the size of the feature
maps to 8 x 8 at the same time reducing the number of feature maps, this is a typical thing that is
seen throughout the networks, so in the next layer you have 16 x 16 future maps with 256 feature
maps in total.
Than 128 x 128 feature maps with of size 32 x 32 and in the end the output is a RGB image
right, that is what we interpret the output as the basically 3 channels of size 64 x 64 okay, so this
is the generator network. Okay, so the discriminator is basically its mirror image basically, so
you start off with 64 x 64 x 3 and then you go back to this size of 128 x 32 x 32, so on and so
forth, so basically what we see in this sequence you go back the same way the okay, all right, so
that is what you have here, so all of this comes here and the second comes there, so on and so
forth and output is a probability of the image being real or fake okay, depending on what your
input is right.
So there are some interesting things in this paper, so basically if you remember that we have to
samples Z from a distribution and for generating new images we keep sampling Z, so what they
required was they are able to see a continuous transformation of images as you keep changing Z
on 1 axis, so the generator was able to meaningfully interpolate between Z, so the paper has
some very excellent examples, so you can go and look at them.
1178
So the idea is again, just to summarize you have a dataset of training samples, we will state at
using the M is dataset and we have discriminator which is a deep neural network in this case, we
have sampling, noise DCGANs paper is 100 dimensions and in the generator takes that as inputs
and outputs MNIST digits in this case, we are just illustrating it with MNIST digits, which is
again given as input or discriminator which has a loss function based on predicting whether the,
the inputs comes from the training data or whether they are by the output of the generator.
So what is important is that of course you will not be surprised if the generator provides a sound
as output samples similar to the one or same as the ones in the training data, however, what is
observed most of the GANs-based implementation of the generative models is that very
consistently the output images which are similar, but not the same as the ones the training
distribution, these are completely the new images which still makes sense as images and they are
able to also interpret meaningfully between the Z values, that is also a very important point to
note.
So by constantly, by continuously changing Z you can obtain a sequence of transformational of

images which are again meaningful okay, so this property can now has, they have a lot of work
has been done in this area right now, I initially, even the paper came out 2014, there was
problems in generating larger images, so typically the outputs were restricted to size 64 x 64, so
on, however, over time, right now there is a something called big GANs which are, which is able
to give you very large size images at very higher resolution.
However, of course, the memory required MNIST and the computational requirements of course
go up. Okay, so this concludes our session on GANs, we will also look at some applications in
medical imaging as to what this generative adversarial networks are used for, in the medical
imaging domain, however just briefly GANs have wide variety of applications, for instance
especially very at least in the context of medically images, there are situations wherein there is
not too much data.
So, in which case you can do the same as supervised learning using GANs, so you can use GANs
to generate images like that in your training data set. Okay and all the while training the
discriminator for a particular task right, so in this case, the discriminator learns to distinguish
between the real and fake images and in the process, you hope that the discriminator has learned
1179
the underlying representational of your data, which then you can find you, maybe a little bit more
data for us, specific segmentation, classification task.
So the context of semi-supervised learning especially for medically images analyses, GANs have
wide application, so these GANs can also be used as conditional GANs, in the sense that you can
have an additional input, see for both the discriminator and the generator, so that the outputs of
both are conditional on the X as input.
So one such application is image translation right, especially, so let say you have two sets of
applications, so there are some images which are widely available, let say images of a certain
anatomy are widely available in a particular imaging modality, let say MR images are widely
available and CT images are not, suppose you have trained a very deep network for MR images
right, now you have CT images of the liver but you do not have enough data to train a classifier.
So what you do is to have a GANs, like network to translate, so GANs have a lot of applications,
some of these applications are summarized in this website, it is a very interesting blog, I urge
you to go, look at it, however in the context of medical image analysis, there are, this I has very
crucial role to play, especially in the context of semi-super based learning, suppose there is a
paucity of training data, in fact, label training data, then you can use GANs to train a
discriminator which learns the underlying structure of the data and then maybe fine tune with
whatever little data is available right.
1180
Because for training the GANs, you do not probably do not need or label data, just to have access
to a lot of images, let say of a particular variety, of a particular anatomy are particular disease
and then you can train GANs with it, but since you lack the labels it does not, it is hard to train a
deep classifier from scratch, however, once you train a generative adversarial network to
generate images for a certain anatomy, then you can take the discriminator and fine tune and
hopefully it will be a good classifier, so that just one very nice application for GANs.
And there are other fields, like in the case of image translation that there are lot of interest,
especially since in the field of medical imaging there are lots of imaging techniques, so MRI,
CT, etc and in some anatomy and some disease cases there are more images available with a
particular scan, let say there are more MR images of the brain available, or let say more MR
images of the, more CT images of liver are available and there is more training data available,
label training data available for CT of the lever other than MR of the lever, then it is could be
convenient to setup a GANs that work, which can translate CT to MR images, so that you can
label using any classifier you have created for the MR and then translate it back to the CT
images.
Of course these are research problems and with the advent of this GANs network, these things
are made possible, the recent pass there are being 7 developments something called big GANs as
come up which is able to generate larger images because most of the earlier networks following
2014, they have been only able to generate a very small size images, in the sense 64 x 64 or up to
128 x 128 and the resolution was not great either. Okay, so with more progress made in this
field, a lot of the interesting problems can be tackled, especially in the field of medical image
analysis.
1181
Professor Dr Ganapathy Krishnumurthi
Variation Auto-Encoders (VAE)
Hello and welcome back in this video we will look at variational autoencoders which are class of
generative models that provide a principled way to sample from the data distribution or the
model distribution. This is a brief outline of our talk in this video, so we will start off with some
introduction to autoencoders and what is meant by latent vector or latent space and then we will
move on to variational autoencoders and finally, we will conclude with variational inference
which is the probabilistic interpretation of VAEs. So in this video we will primarily focus on
how variational autoencoders are used in deep learning context.
1182
So we will look at autoencoders, so they are class of neural network which are trained to
reconstruct the input. So if you are thinking of images as input, typically the most standard
involves seems to have a lot of applications in the computer vision or image or using images. So
let us say input data is some image let us say data set 28x28 and the idea behind this neural
network is to reconstruct the output, so the output will be another image which will be 28 x 28
and optimization in this case proceeds by making sure that your reconstruction and inputs are
consistent with each other, reconstruction loss is minimized.
Typical structure for autoencoder basically the input, you can think of it as a fully connected
neural network or an MLP then you would rasterize the 28 x28 image or 4784 pixels and you
would have successive layers with decreasing number of hidden neurons, so we come to a layer
where which you call the bottleneck layer and from there we have again successively increasing
number of hidden neurons in every layer till we get to the output which is the same size as the
input layer.
Now this layer from which the output is reconstructed is called the, we can call it the
representation or the latent space okay, so we will denote it by Z ok. So the idea behind using
autoencoders is to obtain a reduced dimensions representation of input in the sense that it retains
most of the significant variations in your input enough to reconstruct, so that is the whole idea
behind training the auto encoders. However, there is no structure to the, it is very hard to enforce
1183
some structure to the Z that we estimate. So typically you would have some sparsity constraints
and things like that so that you get some meaningful representation, otherwise, it is difficult to
impose some structures in Z.
So what VAE do is to take this a little bit further, in fact, the names are similar of the names are
similar and the connection is very superficial because we can see that actually see later, this VAE
come from probabilistic arguments which are much more structured and leads to a more
meaningful Z. So this Z as I mentioned earlier is referred to as the latent space or a latent vector
which is what we are trying to infer.
So what does this give us in terms of advantages, okay. So 1 st that is the latent space or the latent
vector that we estimate is basically reduced dimensionality, so in the case of images if you have
very large images maybe it is possible to get a latent space representation which is only maybe 2
or 3 parameters ok, or maybe hundreds of parameters instead of millions of parameters when you
thinking of natural images. And what we do with this latent representation, the idea is the
representation is actually can be actually used to create new images or generate new data with
sampling randomly from it. So what we ideally want is a probabilistic or a probability density
functions for Z so that we can sample from it and from there we can generate images which are
close to the training data that we use ok.
1184
So for instance, let us look at the digit generation right, so we have training data X of MNIST
digits, it is a 28x28 images, you know that they are 60000 of them. And our purpose is to
generate digits like X but not really the same ones found in dataset alright, so which is like
saying we want to maximize this likelihood p( x) that is typically we want in a generative model,
we want a probability model in our input data. So what would be a latent structure or latent space
represent, so in this case these are the things that we are unable to observe so we only observe
once the person has finished reading so we do not know what are the different strokes that he
would have used or she would have used.
And there are also other things like orientation, how big are fonts, how thick are the strokes, so
on and so forth, these are some of the latent structures or the latent space involved in the output
that we see, which is basically the digits, the observation that we make is the digit itself or the
image of the digit itself but what went on to creating it we do not know ok. So what we are trying
to do is to model that latent data, so basically we model that as in the form of a latent space from
which we sampled this vector Z ok, so Z is a random vector usually with lesser dimensions than
that of the input data. And what we also want is a distribution for Z okay, so we can have a prior
distribution for Z ok from which we can sample Z and we can also define a posterior distribution
and this is what most people are interested in that is given this data set, what would be the most
likely Z values that we can have okay.
So once we have this distribution then we can sample this Z from the distribution and map it to
Sample X, so that is the idea behind variational autoencoders is to infer this distribution p(z|X) so
that we can use that to sample from one of the distributions X ok.
1185
Basically, as I mentioned earlier, the idea is to map training data in this case input images to a
latent space using a neural network ok. And the latent space is basically the Z, which is posterior
distribution p( z∨ X). And the prior distribution, we have some assumptions about prior
distribution and we usually model them as Gaussian distribution ok. The output of the network
which provides p( z∨ X) basically, it does not exactly give you samples from p( z∨ X), instead it
gives you since it is a matrix, it does a Gaussian, it gives you two parameters basically the mean
and covariance value of the distribution okay.
Now we draw a random sample from the latent space, so once we know the mean so we would
know μand we also know the covariance matrix, so using given this information, we can sample
Z from this latent space and use that to generate data which is similar to the training data X ok.
So how is that done? So again not very surprising, we will use another neural network which will
Sample from μ, Z that are depicted take that as input and map it to an output which is very
similar to your input data or input images. So if you take MNIST digits, you would map from
your sample of latent vector to a digit which looks like it came from the training data distribution
ok.
Of course you can also interpret the output of this network, the network that takes p( z∨ X) and
gives you a sample from the training data similar to the training data distribution. And we would
like to if we interpret our output again as you know as the Sample from Gaussian distribution
then it leads to a reconstruction loss, we will see how that is done in the next few slides.
1186
So variational encoders consist of these 3 components; the encoder, decoder and the regularised
loss function okay, so this is from the deep learning point of view. So as you saw earlier, the
encoder takes training data as input and provides you the parameters of the latent space
distribution, and decoder takes a Sample from the latent space distribution and provides you with
the output which is similar to do data in the training data ok, and these 2 are basically
parameterized, the encoder and decoder both are deep neural network right so we can use them
ok, can be convolutional neural networks or just fully connected once, we can use either of them
so typically for images you make sense to use convolutional neural network ok. And we have a
regularised cost function, we will see what that is in order to optimise the parameters of this
neural network.
1187
So the VAE architecture, so this is your training data as input so training data samples are given
as input. q( z∨X ) is actually represented by a neural network, so this is the ϕ or the parameters
of the or the weight parameters or weights of the encoder, right. And it gives you output Z in this
case again slight abuse of notation it does not exactly, you do not exactly get Z but you would
get the parameters that depends upon X ok, so for every X that you give as input, you have a
corresponding μ and σ coming for it and you have this decoder which is again the neural network
with the data are the again weights or the θparameters, weights of that network which takes its
samples, so this is Sample using μand σ right.
You sample using μand σ and you get a Z, this is given as input to this neural network and it
provides an output X which is a sample, which is similar to ~
X which is a sample which is similar
to X ok. So, in this case, we have given a particular X as input from which we have sampled Z
right and so we would expect the output ~ ~
X to be the same as X, X will be same as X that we
gave as input over here ok. So this is the VAE architecture and in order to train this we have a
cost function which consists of 2 parts, we will look at what they are and that ensures that this
neural network is consistent in terms of a particular Z giving rise to a particular sample from the
training data ok.
1188
So the VAE loss function consists of 2 terms ok, so this is your data reconstruction loss is typical
in most networks, and this is your regularizer. So if you look at the regulariser, the regularised is
nothing but the KL Divergence, so the KL Divergence between the distribution that is the output
by the encoder network and your prior model for the distribution of Z. So we have actually in
this case in the case of VAEs, both of them are modelled as Gaussian distribution okay. So the
KL Divergence, if you recall, defines you how similar two distributions are, if they are exactly
the same then you get 0 ok so that is the optimum value that we get for that distribution.
1189
So the sum of these 2 is what you optimise, so let us take a closer look at each one of them. So as
I said earlier, so q ϕ ( z∨x) is the output of the encoder. So which gives you, in this case, we have
modeled them as Gaussian so you will get a μand σ corresponding to a Gaussian right. pθ ( z) is
another Gaussian right, 0 mean and unit standard deviation okay, again if you look at the
covariance metrics, it is constrained to be diagonal okay, and so is this one so identity metric. So
our prior for Z is basically a standard normal distribution and we constrain the output to the
parameters of this Posterior distribution q¿ ( z∨x )which is output prior networks whose
parameters are ϕare again interpreted as the mean and covariance metrics of the Gaussian
distribution, of course with covariance metrics they constrain to be diagonal ok.
Now this is your reconstruction loss, look at it so this is the output of the decoder right, the
output of the decoder is actually an image and the output in this case if you are looking at the
image, it is looking to generate images. Now if we assume that this distribution that if we
interpret the output also as the mean of the Gaussian distribution then you can again once again
model the output as the parameters of the Gaussian distribution and the mean is given by the
output of the network characterized by data ok, so if you take the log of that then you will get,
you will end up with the usual least square loss function. On the other hand, if you are looking at
let say binary images wherein the pixel values are 0, 1, then you can also use binary cross-
entropy.
1190
Binary cross entropy between your input image where the pixel values are either a 1 or 0 and
output of the decoder with values ranging from 0 to 1 which is done by using a Sigma at the
output, then you can interpret each of the output pixel values as probabilities drawn from the
Gaussian and which we can use either as a if you take a log of that then you again you get the
usual least square loss function. On the other hand, if you treat the pixel values as binary
variables themselves and you can still use the cross-entropy ok that which comes only binary
distribution ok. So each of them each of these are possible, so we have a combination of two-loss
functions, one is the reconstruction loss and the other one corresponds to the constraint on the
output of the encoder making sure that the parameters correspond to that of a standard normal
distribution, so that is the regularisation that we impose ok, so this is from a deep learning point
of view.
Once again this is essential because we can say well since you are constraining it to be drawn
from a standard normal distribution, why do we even need the encoder network right? So you
can always say I will just draw from Z and then I will try to reconstruct some arbitrary X, so the
idea behind doing this is that we want to make sure not all Z will give rise to the X that we have
in our training data. So we specifically want to work with Z, the vectors that we are now using as
hidden representations or latent space representation to correspond to the data samples in our
training data. So we have an encoder network that takes particular X from our training data
distribution and maps it to Z and we try to reconstruct the same X using the decoder from the Z
that we have obtained okay.
Now the cost function to be optimized, the cost function to be optimized has to be done on an
image by image basis, so in a sense, there are 2 ways of looking at this; one is just taking a look
at this loss function. So once you pass through the encoder it has given you a particular μand σ ,
then you can sample a lot of Z values from that using μ and σ , pass it through the decoder
function to reconstruct and of course the expectations values will be over all the Z values that
you generated for that particular X. Now as I mentioned earlier, this will become slightly harder
to do because you have to sample lots of Z because not all Z that you sample will correspond to
the X that you have at hand.
So to make this thing simpler, so we will have the encoder function so the expectation value can
be done over for the loss function can be done over all the samples in your training data, so the
1191
way it works is if you have the training data, let us say many batches or the entire training
dataset, you will pass it through the encoder to obtain μ and σ , we will sample a Z from each of
the output correspondings to every X of the encoder and pass it through the decoder to get the
reconstruction loss so that we can do back-prop on that cost function okay. Still one problem left
because the sampling function per se is not a differentiable function right, you cannot
differentiate the sampling process so there is a workaround for it called the reparameterization
trick, we will see what that is.
So just to recall that briefly, this is the image on the left is what we look at, so we have an X, this
is a bunch of X that is the samples from the training data goes through the encoder, gives you
parameters of a Gaussian distribution from which you sample a Z right here, then you pass it
through the decoder which will give you a reconstruction of your input ok. So then you have
some of 2 loss functions, this is what we saw earlier as the data reconstruction loss and this is to
constrain this is the regulariser which makes sure that the parameters that you are generating
from the encoder remain close to the normal distribution.
Now doing it this way so we want to back prop through the entire network all the way, but that is
difficult to do so through the sampling process because the sampling itself is a non-differential
function right. So to evaluate this reparameterization trick what it does is, it uses a standard trick
right so if you have a sample from a standard normal distribution, you can convert it into a
1192
sample from a Gaussian distribution with a particular mean and standard deviation that is what is
exactly here. So everything else remains the same, the usual encoder X goes to the encoder to
give you μ and σ but you Sample ϵ , for every X you will sample ϵ from standard normal
distribution right, and then you will do the trick of multiplying by σ and adding μ to it right and
that is given as input to the decoder.
Now back prop can go through the entire network in order to update all the weights okay, since
this particular since this process of sampling σ from N zero I do not depend on the parameters of
your either the decoder or the encoder ok. So once you have trained to the entire dataset, what
does it we do with it? How do we let us say how do we generate samples? It is actually very
simple, all you do is you Sample from the standard normal distribution. Since we have
constrained the encoder or regularise the encoder to that way, you can sample some Z from the
standard normal distribution. You can get rid of the encoders, you can discard this, now you have
got Z you know how to Sample Z from, and you just have to put Z through the decoder ok and it
will provide you Sample.
Now how this, we have already looked at auto encoders so the similarities the following, both of
them have an encoder and a decoder we saw that, so we take that the classical autoencoder takes
as input your training data samples and successive hidden layers there is a bottleneck layer which
is treated as hidden representation and that hidden representation is then mapped back to your
input, so which is exactly what happens here. We have this Q which maps X to Z that is the
bottleneck layer in your autoencoder and then you have the decoder which takes the input as
your hidden representation and provides you with the output ok, so this structure is called as auto
encoder, otherwise the principles are different.
The advantage behind doing it this way that is making the output of the encoder stochastic we
are not treating it as some deterministic value Z but rather treating it as the parameters of the
distribution from which Z is to be drawn. Now this provides some interesting results because it
provides meaningful or a structured Z representation in the sense that if you sample close to the
Z values that are estimated for training data X, you will get similar X right. And if you
systematically sample Z over the range of Z values that are being generated using this network,
then you can see that we have mapped to some attributes in the images that we are using ok.
1193
It will not be exact but it gives you a very smooth transient, as you change Z gradually, the
images will also transient smoothly okay. The structure of the images can be captured in the
distribution of Z rather than providing some arbitrary hidden space representation or latent space
representation, you have a very structured Z that comes out of this process and that is made
possible by treating the output of the encoder as stochastic rather than as a deterministic hidden
layer in autoencoders okay.
1194
Another point is that the cost function that we derived from the point of view of deep neural
networks if we go back to what we saw earlier, we have sum of these 2 terms right so this we
wrote down as data reconstruction loss plus a regulariser, so that is from a deep learning
viewpoint however, it is actually possible to derive this cost function from probabilistic
arguments. So starting from trying to maximize your the likelihood of the data, it is possible to
derive this particular cost function to be optimised, so this is a much more principled way of
creating a generative model.
1195
The innovation here is that the outputs of the generative model that is this p( z∨x), those are the
parameters of the distribution, that has been replaced by a neural network. So the parameters of
p( z∨x) this PDF has been replaced with a neural network which basically regresses your μand
σ of your p( z∨x). So μ and σ corresponding to p( z∨x) are regressed by your neural networks.
Similarly, the process of mapping from Z which your Sample from p( z)provides to data, the
data on which we actually want to get, the generative data is also accomplished using a neural
network.
This provides you get advantage because neural networks can pretty much learn very
complicated functions from higher-dimensional space because the inputs to for instance if you
take the input to the encoder, let us say it is a picture, it is an image, it is a 100 cross 100 image
that is like dimension of 10,000 right. So your 10,000 input dimensions can be easily handled by
a neural network, it provides a structure to do that, it is a non-linear function mapping. Similarly
from mapping from Z to X is again this function that is denoted here that is also mainly possible
by the neural network, so the F here represents the neural network, the P and Q are the
probability distributions ok.
So we have parameterized the probability distribution as Gaussian and we have estimated the
parameters of the Gaussian using neural networks. So this is the strong point of this variational
autoencoder in addition to having a principled way of obtaining the cost function, which means
that the Z that you generate from your training data have are meaningful, this is not possible in
let us using a classical auto encoder because the only thing you can do is possibly make its parts
to have a good representation, but in this case varying Z smoothly as many papers have shown
which will refer to later, varying Z smoothly can give you variation of your input ok, so that is
the advantage of using variational auto encoder.
1196
So to summarise, the encoder is a neural network here then coder is a neural network that
transforms the input image into the parameters of the Gaussian distribution ok. And this
Gaussian distribution corresponds to the latent space, so we have a mean and covariance
corresponding to the latent representation and we regularise it by making it close to the unit
normal distribution that is how we regularise the network. Now we randomly sample from the
latent distribution and we assume that Z generates the input image that the input to the encoder
or say input image is the input to the encoder.
Then we send that Z through the decoder which is again another neural network, it is a function
mapping which transforms that Z into the input image ok. And we optimize, in order to optimise
we do the reconstruction loss that is make the output of the decoder as close as possible to the
input image to the encoder. So during this process over your given input data trains your decoder
appropriately, so once the training is done you can discard the encoder if you want and just
sample from unit standard unit normal distribution, pass it through the decoder and obtaining
samples of your data that you want to generate.
1197
Applications: Cardiac MRI - Segmentation & Diagnosis
Hello and welcome back, so in this application week we will look at an application of deep
neural network specifically CNNs and a combination of machine learning algorithm
classification of algorithm, a problem in medical imaging or medical image analysis. This is
based on a paper out of lab which appeared in medical image analysis titled “Fully
Convolutional Multi-Scale Residual Densenets for Cardiac Segmentation and Automated
Cardiac Diagnosis Using Ensemble Of Classifiers” ok, so my Ph.D. students have co-authored
this paper along with me ok, Varghese and Mahendra Khened. Mahendra will walk in through
the code and some pieces of this algorithm later on, I will just give you an overview right now.
1198
This is the outline of what we are going to look at, so we will give you introduction and
motivation to this particular problem because you might not be aware of what medical imaging
technique are and what they are used for, et cetera. And then the proposed methodology,
segmentation pipeline, and the rest about the diagnosis of the pipeline and the code walk-through
and some aspects of the data will be done by Mahendra okay.
1199
So let us go to the introduction, so if you are all familiar with magnetic resonance imaging, if not
then just look it up on the web or some resources, physics of how it operates, it is the entire
course by itself, but we will just see for the sake of these slides you can guess what it is used for.
Magnetic resonance imaging is one of the main tools for assessment of cardiac function okay,
cardiac function means function of your heart ok. And it is also considered the most reliable
method and people used at in the clinic when you go to the hospital and take an MRI of your
heart, they can use that to figure out if there is something wrong with the function of your heart
ok.
So let us see what MRI does, so what many of the modern image for imaging techniques do,
which is CAT scan or in this case the MRI scan, it gives you sections of your anatomy okay. So
if you recall, if you go to hospital you lie down on a flatbed and then the bed moves into the
gantry and you lie there for some time while the images are acquired. So it is like slices your it
acquires slices of images that run through your cross-section of the body okay, so these are
called referred to typically as axial slices.
MRI is slightly more flexible, it can acquire images at random cuts so you can make a cut like
this and look at the cross-section of it ok. So for the purpose of this particular paper we are going
to look at the cardiac imaging plains. I also urge you to look up some 3-D models of the heart,
lots of the anatomic atlas are available online, but for this lecture we will just concentrate on this
one. We typically acquire a volume of the heart, so the entire organ is covered by this can so
slices through the heart are acquired and the way we slice the notation is important. So if we look
here, what is shown is LV is the left ventricle. Left ventricle is the chamber in the heart that
pumps out blood into your system ok, through the aorta that is where the good blood comes out
of your heart.
And if you cut and LV, it looks like kind of a overleaf if you think about it, and if you cut across
its short axis those image sections are known as short axis images. And there are the long axis
planes which cuts through the section of the heart, think of it like cutting this way okay, suppose
a short axis which cuts this way right. And the other plane is a four chamber plane which is
basically the plane which will show you all compartments of the heart; the heart has 4
compartments or 4 chambers; right ventricle, left ventricle, right atrium and left atrium. So if you
1200
want a view that has all the 4 chambers, cross-section of all the 4 chambers in it that is called the
4 chamber plane ok.
So typically these are the 3 orientations at which the volume of the heart is used, so I can think of
the image acquired as a volume, volume is a 3-D matrix ok. It is a grid 3-D grid and where it is
pixelized of course because it is a reconstructed image, acquisition is discreet okay, the sampling
is discreet so you get a discretized image 3-D discretized image and you can look at each plane
of the 3-D volume as belonging to either a short axis, long axis or the 4 chamber plane ok, so we
will be primarily working with the short axis plane, so we will see what that looks like ok.
So here is the picture of the animation of the short axis view. So if you look at this, this is the
compartment through which the blood actually goes through, the dark region here that is called
the myocardium that is the one, the myocardium is the muscle ok the muscle which actually
twists and compresses the heart, you can think of it that way and pumps the blood out of the
heart into the system ok. So if you acquire the images of a beating heart is not a stationary organ
and since magnetic resonance imaging helps you acquire images very quickly, you can actually
acquire heartbeat. Since you cannot sample different heartbeats at specific time intervals and put
together this image.
So the entire one heartbeat is one cardiac cycle like compression and then release that is acquired
during that the entire stroke one stroke of the heart which is like it pumps the blood out that is
1201
called one cardiac cycle and the images are acquired throughout the cardiac cycle. So you have
volumes, let us say for instance typical heartbeat is one second, you can end up acquiring 20 or
25 time points in that one second ok. In each second you will have the heart in a different state of
compression or expansion in one cardiac cycle okay, you can think of it that way. So this plane
on the left that you are looking at is the axial plane okay, you can see the cross-section this is a
short axis cross-section ok.
The 4 chamber view is right there, again the left ventricle are long axis are shown right there and
this shows all 4 compartments, you can see the heart, the section of the heart where all 4, the
blood travels through all the 4 compartments before it gets pumped out of the system ok. And so
that is the 4th chamber view right then and this is the long axis view when you can see around the
length of the left ventricle if you can think of that way. Typically for all this cardiac MR imaging
there are many diagnoses that are done by with cardiac MRI, but what is of most interest to
physicians, clinically important factors are obtained by looking at the left ventricle because that
is the one that pumps blood out into your system and any weakness there or any defects of
pathology there is obtained by studying the left ventricle during the cardiac cycle ok.
So another terminology that you should be aware of is called the systole and diastole, so think of
the diastole as when the left ventricle is filled with the blood that is when the myocardium is let
us say is relaxed a little bit and it fills the blood that is the diastole, the systole is when it is fully
compressed that is when it is actually pumping it at out of the system, so these are the two, and
the images are acquired between end-systole and end-diastole that is typically the terminology
that is used. So just to reiterate it, let us say if your heartbeat is 60 beats per minute so you have
one bit per second which corresponds to pumping out once, so one pumping action, so there are
typically 20 to 30 image volumes the entire volume of the heart is acquired between that one
cycle.
So if you run it like a movie that is what you get, if you put together if you take a particular
cross-section of the heart and you take the same cross-section from every time point image and
put it together, you can make a movie like this that is how this is done.
1202
There are some other terminologies that give better view of the heart, so here you can see the
base of the heart is basically, if you think about it is the top of the heart, it is standing up that is
where the base is and the apex is that tip right there, so that is how is the direction if you think of
it the base to apex or apex the basis how it is referred to as ok. And just continuing from the
previous slide, if you look we have this region in green which is referred to as the myocardium,
the myocardium is the muscle that does the pumping okay. If there is something wrong with that
muscle that is when you get all these heart problems like heart attack, cardiac arrest, things like
that.
Left ventricle is the cavity which contains the blood okay, and right next to it is adjacent
chamber is the right ventricle again highlighted in green.
1203
So if we go back to the previous slide, so if we look at this particular slice here, we can see the
LV very clearly, this is just the left ventricle alone. And if you look at this slice right there, that
is where the this is the left ventricle as I point out right next to it is the right ventricle that is the
region, I just marked it just to show you of course. So I am not an expert in anatomy, I just learn
some of it in the context of this problem okay, so you should if possible consult some radiologist
or somebody who is a renowned medical student or somebody who has done medicine, he can
probably point this way better than I can. In the conduct of the problem, these are the anatomies
that we will be concerned with.
So the dark region that you see use a different colour, maybe I will use green so this dark region
here which I am going to share that is where the myocardium muscles, those are the muscles
which pumps your blood out of the system okay.
1204
So that is being highlighted here, the left ventricle is the cavity that holds the blood the
surrounding muscle that is the myocardium and adjacent chamber is called the right ventricle ok.
So before we go any further so what is the challenge? Okay so what are we going to do, so a lot
of times when the patient its image, doctors can the radiologist can just go through these images
and can see the defects sometimes. A lot of time some quantitative analyses has to be done, so
one of the things several factors that are calculated are the ejection fraction okay, you know if
you want to look at the area of the myocardium or you know the size of the myocardium and also
for all of those tasks where you need some quantitative markers to be obtained from the image,
what you need is accurate segmentation.
So this is what we use what we call semantic segmentation when we look at convolutional neural
networks. So basically we have to label the pixels corresponding to the myocardium, label the
pixels corresponding to the left ventricle and label the pixels corresponding to the right ventricle
ok, so these things have to be accomplished automatically. So why is this so important to do it
automatically you, why cannot just the radiologist go and do it? There are problems there will be
many volumes in time so I will talk about that, but even if we consider volumes acquired at end-
systole and end-diastole that is, end-diastole is we can think of when left ventricle is filled with
blood and end-systole is when it is pumped out okay.
So if you look at it, so each volume will have hundreds of slices okay each volume will have
hundreds of slices, and if you are also going to go through several time point then it means that
1205
you are gone to have look at about thousand slices of information, thousand pictures if you want
it that way, thousand pictures in which it has to go and manually mark its boundaries that is
going to be prone to error and it is definitely going to be time-consuming, I do not think he can
just see more than a patient a day that is the case okay. So if in that case so in automatic
techniques for figuring out this relationship for doing the segmentation right, so that is one thing.
2nd important thing is what do we do with all this quantitative measurements right? What do we
do with it?
The idea is to predict disease okay, so a patient walks in may be is being referred to by some
doctor by some cardiac specialist that may be used should have your heart scan by MRI, so he
does the scan and the image comes out ok. Then maybe it is nice if you have a system that
provides a diagnosis on its own like what kind of, there are several conditions we will see what
those conditions are, in this case we are considering I think 4 conditions 4 to 5 conditions which
can be actually inferred from the images themselves, can we do it automatically so that is the
question. For instance, this is like saying if you look at the you know do a whole body scan of
the patient and say you have cancer something like this, some diagnosis like that without having
manual intervention so that is the idea.
So there are 2 folds; one is the automatic semantic segmentation as we know it, the 2 nd is the
prediction of the disease in the heart based on the quantitative markers or values that we obtain
from the segmentation ok, we will see what those values are later on in our talk.
1206
So what is the pipeline for this work? Okay, so generally this as we have about 3 blocks okay, so
the 1st one right here, the 1st one we are going to look at that is the preprocessing okay, it contains
everything that some of the aspects of you data normalization also is contained in that so we will
see what this preprocessing technique exactly is, so the preprocessing helps us to actually zoom
in on the region in the image that is of importance to us.
1207
So why would that be important? So if we look at the picture here, see it is a pretty big image I
think right and it is about 512 x 512, it can vary from 256 x25 6 to 512 x 512. Look at it only a
small region here, this is the region of interest for us okay, this is where the heart is, cardiac
muscles are and only we need to we need only this region. So to make our CNN’s job easier so
we do not want a very large input to our CNN, if he can just crop this region out in every image
successfully and automatically that will be useful okay, so then that is what this particular piece
does.
The 2nd piece again data normalization ecause it is part of the training process I guess of the
neural network so we have a neural network portion here so that is our deep CNN we can call it
okay and the output is the segmentation, we do some postprocessing of the segmentation, make
sure that it is clean, the third-place here is a classification of disease ok, so these are the system.
So we will go through each one of them, I will give you a brief overview and then we will see
that somewhat the individual steps are in much more detail.
So the region of interest extraction, look at the step 1 so how is this accomplished right? So I
have mentioned to you that if you have let say if you have 60 beats per minute that is your heart
rate so you get about in a second you have 1 beat is completed and then we acquire about 23 or
20-30 volumes in that one beat right, that is the entire volume. So then that is useful because if
1208
we consider the same plane the same height, same cross-section in one of those time points let us
say so, for instance, we acquire M = 30 time points, which corresponds to 30 volumes ok.
We select a plane, the scanner has some references so then we will be able to figure out in
actually absolute terms where a particular Z value lies, where a particular coordinate lies, so we
should be able to select that plane and if you arrange that plane in the time series that is what it is
called, it is seen as a cardiac phase so different stages of compression or expansion of the heart is
referred to as cardiac phase. So if you arrange them, you see a stack of images and we can take a
standard deviation across that stack. So you can arrange them as stack and if you take one pixel
you can look at that pixel through the stack and calculate the standard deviation.
So that standard deviation picture is what you see here, what is obvious here is that if you look at
the edges right that has the high value of standard deviation not surprisingly because the heart is
moving so it expect a much larger standard deviation near the edges where the myocardium is
and that will help us actually localize the heart. So we can to standard deviation in threshold and
the regions of large standard deviations would correspond to the moving parts of the image. So if
not all algorithms are moving, there will be some slight distortion of the liver and all that because
of the heartbeat but that is not too significant so the highest standard deviation would correspond
to the moving anatomy which is the heart so that is the image that we get, the standard deviation
image taken across different time points of the cardiac cycle or the cardiac phase.
1209
So once we have that we can do an edge detector on the standard division image okay, we can
use the standard canny edge detector, if you do not know what a canny edge detector is, you can
search about it in the subject, Matlab has canny edge detector inbuilt so you can just look for it in
the help section and try it on some standard images that are provided. So once we do this edge
detection, we can we can then use, now you see that the circular objects are quite well-defined
when you do the edge detection or edges are clean. On top of this, you do what is called hough
transform ok, so the hough transform again we have looked this up, we are not going to go
through all of these techniques as it is not image processing course.
So hough transform helps detect circles, so this is circular hough transform that we are using has
detect circles and images, typically if you have black and white image like the one we have
generated here, you can use it to automatically detect circles. Now if you are wondering why that
is so important, just think about how you would actually detect a circle in an image, it is actually
a nontrivial problem to do okay. So we will use a hough transform to detect circles, and once we
have the hough transform to detect the circle then we can localize the left ventricle centre of the
left ventricle through some statistical techniques and then have a box around it that is the crop, so
we have fixed a standard box size and we take that out okay.
So do all the images we will use typically for training and testing and validation. On the other
hand, we can say that well the CNN is supposed to figure this all by itself that is also possible, if
you do not want to do this crop preprocessing which involves localizing the left ventricle, you
can actually feed the direct image all the entire image directly to be CNN as input. I am just
going to take it when we will study more difficult problems to solve because there is so much
confounding anatomy is present okay, but in general, we can do either of these approaches. The
point I showed you this is also because see we do not have to, just because this there are these
deep learning tools available to you, you can use that to solve every problem that is true but then
many of the conventional techniques are out there, well studied, well-documented and nicely
programmed, you can use them actually do make the problems much simpler.
So what we have done by doing this cropping is to make the problem very simple, in the sense
that CNN now does not actually have to localize the heart right. We know we are only giving it
the beating heart as input, the other parameters are the ones that we try to detect edges, et cetera
would be the ones to focus on okay. Again yes, of course, you do not want to do any of this that
1210
is also fine, you can take the entire image, give that as input to the CNN. Also recall that you
know when we did this memory issues were also there but now I know there are more powerful
GPU parts and more memory available in them and RAM available, so in fact you can fit very
large images and feed them as input to CNN, of course training them is also much harder, I have
to figure out the correct hyperparameters for the same ok.
1211
So what are we going to use? We are going to be using CNN, I know surprise is there because it
is a powerful tool for image processing. So once we have just walked you through some of the
ones we have already seen, so this is a typical CNN that we talked about where you have an
input image or an RGB input image in this case you know you can also have a series of a stack
of images as input that is also possible now, we have looked at 3D CNN set, I think I mentioned
this when I talked about brain tumor segmentation but in this case I think we are going to do
plane by plane so this is the input image, then we have a bunch of feature maps extracted by
filters and then a couple of fully connected layer and then the corresponding output categories
okay.
So this is the one typical CNN that you have all seen, just to refresh your memory so in this case
also remember the problem that we want to solve is the one corresponding to semantic
segmentation, so we have to figure out the class that every pixel belongs to, namely the green
which is the myocardium or the muscles of the heart especially around the left ventricle, the left
ventricular cavity LV cavity, this is myocardium, and the right ventricle okay these are the things
that we want to label okay, so they carry the most diagnostic information ok. So how we do that,
we saw that you know by shifting to our fully convolutional neural network CNN structure then
we can do semantic segmentation in one stroke, basically entire image through the one pass all
the pixels are predicted okay.
1212
So that is accomplished by you know this is the encoding layer you can think of and this is the
decoding layer, decoding layer uses bilinear interpolation or transposed convolutions to increase
the size of your feature maps so that you are able to predict all the labels in one-shot.
And also the inception architecture wherein we are looking at receptive fields of different sizes
in the same layer, so typically in most of the networks we have seen inception architecture, we
fixed the side of the field in every layer here, every layer has a combination of different filter
sizes based on the problem, it is 1x1 to 5 x 5 that is the naive implementation. Of course in order
to make, in order to make the computation more feasible we also have the possession where we
do the 1 x1 convolution to reduce the size of each of the feature maps corresponding to different
filter kernel sizes ok. So these are the 2 examples here, for instance, one here this is 1 x 1 to 3 x 3
and 5 x 5, this one has a slightly larger filter kernels.
Both of them are naive interpretation except that this one what is missing here is the 1x1
convolution, which project them into lower dimensions right and then you can do the 3x3 ok.
1213
So the other two concepts again that we are going to be using, the residual network or the dense
net, so the residual network we saw that it improves your training faster conversion and you can
go deeper by actually moving or adding feature maps from the previous layer into the successive
layer. Let us say the input to the previous layer X is then taken as the output and added to the
output of the next layer ok, so the skip connections you can do or shortcuts whichever we are
going to call it ok, we call it shortcut connections right, this is shortcut, we call this shortcut
connections ok.
So where we add them, so here it is addition, so in Resnet, we add the feature map from 2 layers
below to the output, so that helps in faster convergence, better learning and makes you do deeper
networks. Another version of that is the DenseNet ok, here they go, they define a dense block
which has several convolutional layers, and one of the hyperparameters to the dense block is
called the growth factor or K = 3. In this case what happens is, every convolutional layer inside
the dense block output exactly K feature maps ok. And the output of the block is again
concatenated to the input that is what is shown here, so you have L features as input, it goes
through a dense block which has multiple layers including convolutional layers and the output of
the dense block is: concatenated to the input, so this will be like K + L features okay.
So a layer typically one convolutional layer inside a dense block, so this is actually the blow-up
of a, actually one convolutional layer inside the dense block, we have a batch norm layer in the
activation function followed by 3x3 convolution, so prior to the convolution we have all these
1214
other operations which are also defined as layers, so that is just one convolutional layer ok, so in
dense block we will have multiple such convolutional layers ok, and each of them will output K
feature maps and you actually concatenated the input to that particular layer with output and it
goes as again input to the next layer okay, so that is the idea behind dense block that we saw
earlier.
So the architecture that we are going to look at is similar to the architecture which is reported in
the literature is called the paper is titled - “One hundred layered Tiramisu” okay so it has about
100 convolutional layers. So it has the kind of like a similar to the UNET, but instead of just
plain convolutions, you have dense blocks okay. So the transition down blocks are again, so if
you look at it so there is an input where you do a bunch of x3 convolution which and output of
that goes to the dense block and again the output of the dense block is concatenated, where C is
the concatenation with the input.
And there is a transition down layer which is again another dense block except that that has that
is the one that is blown up here, it has actually max pooling and 1 cross 1 convolution ok, just to
reduce the size of the feature map, et cetera. So after a couple of transition layers, you have a
bottleneck layer again 1x1 convolution and then you do the transition up layers right, so it is the
3x3 transposed convolution. So similar to UNET structure, when you go all the way to the top,
when you are actually upsampling in your which is the decoding side, you also have these
1215
connections which will just concatenate because this has the resolution intact, so when you are
upsampling it is good to have the resolution back.
So this is the architecture that was reported in literature, so we have made several changes to it,
one of them very obvious change is the projection here so instead of directly concatenating, we
just project it using 1 cross 1 convolution and we just add it to the layer on the encoding side
okay sorry, from encoding side we add it to the decoding side if you think of it that way okay. So
these are some of the changes so the idea is it is a UNET kind of architecture except that instead
of the plane convolution it has dense blocks and we also have these connections going from the
encoding side to the decoding side, then there is a direct concatenation here we project and then
we add okay.
We also have an inception module at the beginning, if you consider different receptive field sizes
and different resolution ok. The 2nd aspect of the neural network is the loss function, so we do the
usual cross-entropy loss function so we have about 3 classes. But it is weighted, the weight is
obtained from the ground truth so for every image there is a corresponding ground truth here.
You can see some of the rarer classes are weighted more than the classes that occur more
frequently, so for instance if we look at the red regions here, red regions redline here is the
weight map, so this is what you are looking at here is the weight map.
1216
So the red region here has a higher weight corresponding to the other region, and if you look at
the blue regions are weighted much lesser, it corresponds to the weight of 20 beats, of course,
you have normalize this so that it does not blow up, so just to show that in the scale we have
shown it this way ok. So this is a rare region because we have a lot of overlap with the right
ventricle here so that is difficult to segment, so that edges are weighted more than the other
edges. So based on the pixel count, edge location, et cetera, each pixel is weighted differently, so
that is the weight map that goes into the loss function okay.
The 2nd aspect of it is that now since we are doing predicting all the pixel labels in one shot, one
other way regularising the output is by regularising the cost function by using what is called the
dice coefficient. So the dice coefficient what it measures, again this is the formula is the
intersection between the predicted mask and the ground truth mask so that corresponds to this
region, it is mask truth positive right divided by the cardinality of the mask plus the cardinality of
the prediction ok. So you can think of it as the fraction that you got right ok, fraction of pixels
that you predicted correctly ok. So but then this if you want to actually use it inside a loss
function then you have to make it into a form that is differentiable and that is what is given here,
so the p is your output is score from your CNN, so that is the procedure probability it is predicted
by CNN and g is the ground truth corresponding to that.
So xi is the pixel label, so every pixel has its label corresponding or probabilistic output and also
its ground truth are given, so submission over that divided by normalizing factor so that is the
differentiable loss function ok. So when you write the total loss function you have
hyperparameter λmultiplying the Cross entropy loss plus another hyperparameter multiplying (1
- Ldice), so we want to minimize the loss function that is done by maximizing Ldice. So if you think
about it if you want perfect prediction then there should be an expect match between your so loss
Ldiceof 1 think about it corresponds to a perfect match between your prediction and the ground
truth, so if you overlap them we do not see any difference between them so that corresponds to
one.
So if you want to minimize that, (1- Ldice) means then that table corresponds to the current
segmentation right because the highest value of L dice corresponds to better and better
1217
segmentations typically, so minimizing (1- Ldice) is same as maximizing Ldiceok. And then we
have L2 regularisation, W here are the weights ok.
1218
So now we will look at the architecture in much more detail and also look at you know how it is
trained at, et cetera and interpretation of the result so on and so forth, Mahendra will explain the
rest of the architecture from here. So what you see in the slide is the network architecture in
slightly more detail, so the input is basically a 128x128 image that will crop in, cropped out the
region corresponding to the cardiac anatomy and given that as input and then we have the
inception module here right, which outputs about128 x128x36 after which you know there is 3x3
convolution output 12, 5x5 convolution output 12 and 7x7 convolution output 12 feature maps,
these are given as input to the dense block with K = 12 and it gives you about 24 feature maps L
output so then you have about 128x128x60 ok.
The transition down layer reduces it to 64x64x60 so we can do that again alternating between the
dense block and the transition down block all the way to the bottom here when we get a 16x16
maps 144 of them. So we have a bottleneck layer which outputs about 60 maps and again from
this is taken up by the transition up layer, which goes again all the way to the output
segmentation map of size 128x128x1, and in between we have all these shortcuts we call them,
which basically take the feature maps of the corresponding level. So remember, as we come
down this side, this is the encoding side, the resolution starts to go down and on this side we are
trying to upsample so it will be…
1219
To improve the upsampling we take the information from the encoding side and pass it to the
decoding side and that is done by using a projection layer, so we can project the 128x128x60 into
one feature maps and add it to the feature maps that come out of the transition blocks that what
happens in every layer typically okay. So the optimizer use ADAM with a learning rate of 10
raise to -3, weight initialization “He normal” weight initialization is used. Uses the batch size of
16, it is a mini-batch size of 16 and it is trained for about 200 epochs and it has about 400,000
parameters.
We also have augmented data by random flips, flipping is you flip the image left side or right
side, rotations, translation and some elastic deformations which are very small deformations
because you know cardiac anatomy we cannot randomly deform, these are deformations which
we actually manually have to look at the range in which we can do. Tensorflow, this is
programmed in TensorFlow and Tensorflow has its own inbuilt routine for deformation which
has parameters which you can tune for deformation that are not too that will actually what you
call make it you know make it unphysical, so you cannot randomly deform the heart, so you
know there is an anatomy we are looking at so within the confines of the problem that we will be
able to produce deformations okay.
Intensity normalization we use slice wise min-max normalization that is for every slice it is
minimized between 0 and 1, the range is confined to 0 to 1 ok that is the way of doing it.
1220
After prediction we do postprocessing, so what does postprocessing to is, does largest connected
component analysis. So in Matlab these are automated, so please in Matlab please look up
connected component analysis okay. It comes under the topic of morphological image
processing, you can look that upto. So we do the largest connected component and also hole
filling, again morphological binary hole filling.
So what do you mean by that is, if you look at it, there is a nice prediction here but this is not
correct so if we do this connected component analysis, one we do is identify the large connected
components and we can play a threshold on it, this can be slightly arbitrary and it depends on the
problem and remove the small ones corresponding to every class ok. So in every class, we do
connected component analysis and remove the ones that are not meaningful right. So we expect
the left ventricle to be in one piece and we take the largest piece possible okay think of it like
that and so this is a post connected component analysis. We also do hole filling, so, for instance,
this class this is again wrong classification so we have to fill that up with the correct
classification that is also accomplished using morphological hole filling. Again Matlab has all of
these enabled so you can actually look that up also.
I think Python and other, Python also has some of these morphological image processing tools
also available.
1221
Applications: Cardiac MRI Analysis - Tensorflow code walkthrough
Hello, my name is Mahendra. I am a course TA and I am a Ph.D. student working with Dr

Ganapathy Krishnamurthi. I will be continuing with the next session of where Ganapathy sir left.
So I will be describing about the dataset.
For our challenge, we used 3 publicly available datasets. They are the ACDC challenge dataset,
the STACOM and the Kaggle Data Science challenge dataset. In the ACDC dataset, we were
provided with 100 cine MRI cases and for the task of segmentation, we were given the ground
truth annotations for the left ventricle, right ventricle and the myocardium. And these annotations
were provided End systole and End diastole and also the MR case has involved 5 groups of
patients, namely the normal, the dilated cardiomyopathy and hypertrophic cardiomyopathy and
myocardial infarction and abnormal right ventricle.
For validating our algorithm, they provided us with 50 cine MRI cases. Similarly in the
STACOM data, they also provide us with 100 cine MRI cases and 2 groups of patient categories
were involved and the testing set was even 100 here. This Kaggle Datascience Bowl is an annual
1222
challenge hosted on various medical problems. So in 2016, they hosted a challenge for predicting
the volumes of the ventricles in systole and diastole.
Their training set involved 700 cine MRI cases and here no annotations were provided but
instead only the reference volumes at systole and diastole provided.
1223
So I hope the visual visual description of the results I will be presenting to you of our model. So
you could see here, this is our input image and this is the ground truth. Ground truth means the
segmentation contours drawn by the radiologist. And this is the prediction from a model. You
can see that at a particular case of normal and at End diastole phase, the slices are taken from the
apex till the base. You can see the our prediction model has very close similarity with the ground
truth.
1224
And even in the systole case, I am showing the results. The One of the challenges in cardiac
segmentation is the segmentation at the topmost and the bottommost slices of the heart, mainly
the basal slice which are close to the arteries and the apical slice which are close to the end of the
heart. So these are very difficult regions to segment. We can see here. These are mispredictions
seen here. The, our model does very well in predicting the mid-slice region of the heart.
We could see here in the prediction and the ground truth. And when it comes to apex slices,
because of the small region, it is quite ambiguous sometimes to clearly demarcate the apical
sections.
1225
In the ACDC challenge test set, our model gave a comparative competitive result and we stood
2nd in the challenge and the matrix used were both clinical as well as clean geometric matrix. The
clinical matrix included the ejection fraction, the volume set, ED and ES. And the geometrical
matrix were dice score and Hausdorff distance. So using this matrix, they evaluated our
segmentation performance. We could see that in the left ventricle we stood a rank of 2 and in the
right ventricle, we had a rank of 3 and in the myocardial segmentation, we had a rank of 2. Our
overall ranking was 2nd in this challenge and the results are compared with the other participants
of the challenge.
1226
Even in the STACOM challenge, we achieved a fairly competitive result. Our method was fully
automatic and we were on state of par with the other techniques. The metric used was for the
jaccard index. We could clearly see here that even our model had slightly lower predictions for
the apex and the base slices. Whereas in the mid-slice, the predictions were really good.
In case of Kaggle dataset, we did not use any training dataset from the Kaggle. Instead, we used
our ACDC model, trained on the ACDC dataset for directly testing on the Kaggle test set. They
used a continuous ranked probability score which is defined in the formula below. This gives us
1227
a score based on the predicted distribution of the actual volume. So our score was around 0.0127
which gave us a position of 10th in the actual challenge.
So I will be discussing the next part of our work that is the automated diagnosis. In the ACDC
challenge the task involved 2 steps, one is the segmentation as well as the automated cardiac
diagnosis. Once the segmentations were got from the cardiac MRI, we used them to predict the
disease of the heart. As I previously mentioned in the dataset, there were 5 patient categories.
They were described as shown in the figure now. So in the normal heart, is given here and it is
compared with the dilated cardiomyopathy.
We can see in the DCM, the walls are thin and the ventricles are dilated. In the abnormal right
ventricle, the ventricle regions are enlarged here. In hypertrophic cardiomyopathy, the walls of
the myocardium are thickened and in case of infarction, certain regions of the myocardial
segments get thin compared to the rest of the myocardium region. So these were the 5 patient
categories given in the ACDC dataset. The objective was to develop a automated algorithm
which does the classification.
1228
So this is our diagnosis pipeline. So any machine learning algorithm has certain steps. In the first
very step here was extracting the features. So from the segmentation, we extracted the features.
We basically categorised the features into 2 groups, primary and the derived. The primary
features are like the volume of the ventricles and the mass of the myocardium and also wall
thickness of the myocardium at each slice.
And the derived features are something you derive from the primary features like the ejection
fraction of the left ventricle and the right ventricle, the ratios of the volumes of the ventricles and
estimating the standard deviation of the myocardial wall thickness measures within a slice and
across the slice. So the pipeline is explained here as follows. The first step involves from the
cardiac MRI in the segmentation, we used our segmentation model, to get the segmentations
from the cardiac MRI, we used feature extraction module to extract features.
The features extracted are described in the next slide and we classify we group the training set
into training set and validation set. This pit was done for our model selection. Mainly we had to
pick the best classifier for training our algorithm to classify into 5 classes as well as for running
the hyper parameters. So the preprocessing involved feature selection as well as features scaling
and training the classifiers and tuning the hyper parameters and doing the model selection. So we
use five fold cross validation for model selection as well as feature selection and fee and hyper
parameter tuning.
1229
Based on the five fold cross validation, we found that ensemble system does very good in
predicting the disease group, so we developed a two-stage model. In the first stage, it was
aggregation of 4 classifiers. Each classifier was trained independently on all the 5 patient
categories using all the features and we found that these 4 classifiers gave the very best results
individually. So we grouped them and each classifier output was taken and a majority vote was
done to get the final prediction from the first stage of the diagnosis.
After the first stage of diagnosis, we passed to the 2 nd stage. We observed that in certain patient
categories, like in the infarction and the dilated cardiomyopathy, which are closely related
groups. It was quite difficult to discern the disease category. So we had to develop a expert
classifier. We found out that MLP does the best job for this task and it was trained on a subset of
features used for training, these 4 classifiers. We found that the myocardial wall thickness profile
features at end systole were sufficient to properly segregate between these 2 classes.
So the 2 stage implementation is something like this. We have the first stage to get 5 class
prediction and in the 2nd stage, we have a refinement to correct some of the misclassifications. So
the final classification was one of the 5 categories here.
So this slide lists some of the features used for our training. So as I told, the volume features, the
ejection fraction, the volume ratios and the myocardial wall thickness variation profile was
1230
captured using various methods like finding the Max of the mean of the standard deviation and
the variation across the short axis, long axis.
We also use random forest to select features. We can use random forest classifier for doing
feature selection. So this is our one use case of using random forest classifier. So as I told, we
had a two-stage diagnosis. In the 2 nd stage, we had to figure out what are the relevant features for
grouping the patient cases between infarction and the DCM. So we observed that random forest
showed us, these are the important features at the End systole phase which are sufficient to
segregate into 2 classes.
1231
The challenge, our method achieved accuracy of 100% and this is attributed to our two-stage
classification. I can clearly show you how this method was effective. So in the first stage, our
classifier gave an overall accuracy of 92%, we could see there were misclassification in the
DCM and MINF. Even in the confusion matrix, we could see some of the groups were
misclassified among DCM and MINF only. So up on the 2 nd stage, all this misclassifications
were corrected and we are able to achieve a 100% accuracy.
So the paper concludes as follows. We actually developed a very parameters and memory
efficient 2-D multi-scale FCN based on residual DenseNets. And we showed our novelness in
1232
weighting scheme used for combining the cross entropy and dice loss. We achieved state-of-the-
art performance on two challenging cardiac segmentation tasks. We also handcrafted a novel
features for cardiac disease diagnosis. And we achieved a state-of-the-art performance on
automated cardiac disease diagnosis.
I would like to add one more point here. Our loss function was compared to other standard loss,
gave better visual appearance. We could clearly see here. These are the standard loss like cross
entropy, weighted cross entropy, dice loss and weighting scheme used by our method. So we
could see that when compared to ground truth, our model had a better visual appearance
compared to other loss functions. So hence gave a better performance compared to other losses.
1233
We have uploaded our source code online so that people can take benefit of it and I urge you to
look into this code because many examples in tensorflow has been used a SK-learn has been
used. This link would be posted in the announcements section. So please look into it and briefly
gave a walk-through of the code so that you can find it easier to run in your system. So most of
the codes these days are up with the research papers are coming up with the codes and they are
uploaded in github. Github is a common platform where many projects can be developed in
collaboration in it can be shared to the world.
1234
So our paper’s code is put up in the github profile. The instructions for using the code has also
been put up here and the method to train and test and validate is also given here.
I will just give you an overview of how this code is being organised. So basically in any deep
CNN work involves preparing the data, loading the data, preparing the model, the neural network
architecture and developing some estimators for analysing while training like the loss function,
the matrix for finding out the accuracy of the model. These are the groups these are the modules
in our code. To start off, we have to preprocess the data.
1235
This module , data preprocess has all the codes for preprocessing the raw data. When I say this
cardiac data, it comes in a nifty format. These are very specific medical image formats which
might not be efficient to directly process for neural network. So we need to actually extract the
data extract the data, extract some of the information, preprocess them, all this have been given
in this source file. All these are self-explanatory, I urge you to walk through it.
I will just give you an intro onto what this module does. Basically, the medical images are
numpy formats, no nifty format and this need to be converted into numerical Python format,
numpy. So in numpy we can do various mathematical operations for easy and faster processing.
So basically the cardiac dataset as I explained, it is 4D volume or 4D stack. So at each phase of

the cardiac, is a 3-D volume and each phase is a time point. So it is aggregation of time series of
volume data. So we need to extract relevant slices where annotations are there and these slices
need to be converted into a suitable format. And this is, this code does that. Basically, we also
need to preprocess to extract the ROI Centre.
1236
The code for ROI Centre prediction is in this file, the extract roi fft based methods. The codes are
self-explanatory, you can look into it and reading and writing. And also one more thing is that
these files have to be converted into proper formats like h5 file format or hdf5 file format for
faster processing.
The 2nd step here is preparation of the data loading. As I previously discussed, most of the neural
networks are trained on a batches of the data and to enhance the dataset, even the augmentations
are performed. This module does two task. When you need to read the network with data, it takes
1237
relevant files from the pre-processing and the preprocessing like the augmentation and the
deformations are done in this file and also the normalisation.
The loader basically from the preprocessed files, it extract the relevant images, prepares batches
and these batches are passed through preprocessing module where the augmentation such as
deformation, rotations are done, also the normalisation is done when these are fed to the neutral
network.
1238
And in this model module, we define most of the Tensorflow architecture. We have used
Tensorflow to build our network. So this is our object where the where we define our neural
network architecture. So we develop everything using Tensorflow layers module and the lower
primitive functions are developed in this file called network.ops. So where we develop basic
building blocks for building our neural network and all these things, you can look into it for
further understanding.
So how the dense blocks are developed, how the layers are defined, all these things you can look
into these 2 files. Once the models are developed, we need to have a proper estimator.
1239
The estimator is something like which, which is a bridge between the data loading as well as the
model as well as estimating the statistics while training is happening. So here in the estimator,
there is a config file and the train.py file and the estimator file. The config file is file where we
configure some of the hyper parameters like what is the batch size, what are the number of
classes, number of epochs to be trained, what is the learning rate, where is the data located, all
these things are configured in the config file.
After the configuration is done, we have the train.py file where we configure the model
architecture, model hyper parameters like the growth rate, number of layers and the rate decay,
dropout, all these things are configured in the train.py. And this is the main file which we need to
run for, to initiate the training procedure.
This is the estimator file where we estimate while training, what is the loss, how is the training
progressing and also tensorboard functions are implemented in this file. Tensorboard is another
tool which you can use it to visualise, to see how the training is progressing. You can see live
updates of the loss as well as how the live predictions are happening on the training and
validation set.
1240
Once the segmentation is trained, we can use our testing module to test our segmentation results.
We can also upload our test results in the website of the challenge, where they are evaluating our
model. The 2nd part is the cardiac diagnosis and walking through this code.
1241
Once the segmentations are done, we can run this generate.cardiac features file, this file has
functions to read the segmentation masks. From the segmentation mask various features as
described in the previous slides are extracted and these are saved in a csv file. This csv file is
used for training our classifier. In the stage 1 and the stage 2 diagnosis, we have classifier
selection using cross validation studies. So basically we have used all sklearn functions to
implement our classifiers here.
So what we do is we read the CSV file and we do five fold cross validation study and we do the
classifier selection and training of the model on the training set. Once the training is done, we
1242
can use the trained model on the testing set. The stage 2 diagnosis does the similar operation but
here the problem is limited to diagnosis or the correction between the DCM as well as MINF
cases. Once this file is run, we get the final prediction results in the prediction folder where we
get the prediction of each cine MRI volume as either normal or anyone of the four pathologies
of the heart. You can check this model by running in your system and you can also get a feel of
by tuning these parameters or playing around with the classifiers. You can upload in the
challenge website and get a feedback of whatever results you have got. Thank you.
1243
Introduction to Week 12
Welcome to week 12 of our course this is the final week as we had promised earlier we will be
looking at applications of machine learning. Dr Ganapati had already shown you a few
applications of machine learning within medical image analysis, so this week will be primarily
dedicated toward applications in engineering. Of course this is applicable to science applications.
I will show you a quick application towards Schrodinger’s equation also. Now before we go
forward, I want to emphasize that we will primarily be looking at deep learning applications.
There are of course applications to all the other algorithms that we discussed such as SVM, PCA,
KNN etcetera.
And they have been used, actually extensively used in engineering for quite some time over the
last ten-fifteen years in various fields. SVMs etcetera used quite extensively. We will not be
covering those we will just be looking at modern applications. Except for just one, I will be
showing you a few very-very modern, modern meaning is the last two to three years of success
that machine learning has. Specifically, deep learning has had within some engineering
applications. What I will do is just give you a few hints of how these problems can apply to
engineering etcetera.
1244
Now we look at some specific types of applications, so let me talk about these types of
applications the first is what is called surrogate models. Surrogates models are typically faster
models, where physics base models are slow or in some cases impractical. So let me give you
some example suppose you want to make a weather prediction tool. Actually, weather prediction
involves a lot of very-very complex physics, but what can happen is if you have done a few
weather predictions with the complex physics you can make up a sort of faster model or an
approximate model. A surrogate model is an approximate model using prior computations or
prior data or experiments or combinations of the tool.
So using this you can make an approximate model, so I will show you a couple of examples of
surrogate models these are in some sense probably the most popular uses of machine learning.
Now the surrogate model, in turn, can be used for inverse problems. What is meant by an inverse
problem?
An inverse problem is a problem you know normally you go from the cause of something to
effect but in inverse problem is it goes from effect to cause. So I am just showing in terms here
in case you are already using some of these cases somewhere within engineering were signs this
would be a little bit more helpful to you hopefully you will also see a little bit of more use or
sense to what I am saying as we go through the examples.
1245
We can also use surrogate models for optimization. For example let us say you have a shape
optimization problem you want to find out what shape of an airfoil of a wing will give you the
maximum lift or the minimum drag or maximum what is called L by D ratio etcetera, now people
what they do is they will change a shape little bit and then find out drag etcetera, drag means the
force that is applied on the air file as it moves through air now typically this is very expensive
that is each time you shape to change the shape of little bit if you want to find out what the drag
is you have to do the long computation and that is one other place where you can use surrogate
models.
So surrogate models would be the strongest case for neural networks etcetera within deep
learning now the other thing is direct modeling of physics modeling of unknown physics means
let us say like I give you the example of weather problem let us say you have a really-really
complex problem but you know that the output, so we know inputs we know outputs but we do
not know the connection, so the map itself is actually unknown in such cases you will use neural
networks deep learning.
So let me slightly distinguished between surrogate models versus let us say modelling directly
surrogate models are you might know the model it is just that it is very very expensive to actually
use it, so in such case you use surrogate models or the uses called the surrogate models and
either case you will use neural networks, now there a few other cases where you can use neural
networks.
1246
Which is directly solving the differential equation, so I will show an example of this, you can
have a neural network actually solve a differential equation by itself now this is just a hint of
what are neural networks can do of this are very standard this we can tend to do again an again
within engineering and science but this is just a hint of what neural networks can do you can
actually have control problems here typically reinforcement learning useful. For example, a prof
Andrew N.G a several years ago had done the control of a sort of model helicopter using deep
learning ideas.
Now these are some of the uses you can have within engineering and science I show you a few
examples now a quick note of advice for how to approach this particular week this kind of
different from all of previous weeks the theory portion is going to be less I will just show a
problem and it solution and in fact most of the solution I am going to show are by the people,
now in order for you to maximize for learning the best way for you is to I will describe the
problem first in a separate video and once I describe the problem I would like you to think about
what kind of structure would be ideal for this problem what kind of neural architecture would it
be a ANN, would it be a CNN etcetera.
What would be ideal for this problem how would you pose the input how would pose the output
how would you make the network structure, so please think about the problem before you look at
the solution as offered by the other people some of these solutions are actually quite elegant, so
1247
this will actually exercise you mental muscle so that when you actually have a practical problem
you wish to solve because throughout this course we got several questions how would I solve
problem XYZ in some field, now most of those fields neither of us actually know but if you
actually get into the idea of how this problem is actually solved by other people and where you
can use your domain expertise.
So this is a keyword of how this problem actually requires some amount of domain expertise
plus some amount of knowledge of machine learning, so it is these two. Domain expertise means
if I am solving fluid mechanics problems I should know something about fluid mechanics
typically it is useful in how you are going to pose their problems machine learning algorithm, so
as we go through some of these problems through this week please think about how you would
have solved them if you did not know the solution already and then come to the next video where
the solution will be discussed, thank you.
1248
Application 1 Description Fin Heat Transfer
Welcome back we are going to start with a very-very simple problem almost a trivial problem
and the solution as you will see later on as quite simple. So this is the problem of so this is what
discussed quite often in simple undergraduate heat transfer courses. Even if you do not know
heat transfer the problem is something that you can understand quite simply from a physics
perspective. So let us take what is known as a fin, a fin simply means an extended surface, why
is it used? Most often you would have seen this is generator even at the bottom of your
motorbike and several other places you will use notice this very-very commonly it is even used
in electronic devices.
Now the reason this is order to extend heat transfer or in order for something to cool down little
bit more easily. So let us say this is the base it is at some temperature. I am going to call
temperature as , the of the base now. What happens is let us say this portion the base is really
hot, let us say it is you know the car radiator or it is an engine and it has brought a very hot and
you to have air or some other cooled fluid going passed it and you want to cool it and how would
you do so?
1249
Now all of us intuitively understand whether you know heat transfer whether you know fluid
mechanics or not you intuitively understand that you have a few ideas with which you can sort of
increase heat transfer. Which is if this is hot I make sure that I blow air pass, you know suppose
you get something hot you kind of blow in it intuitively it is built into the human system if you
blow air to pass it, it will become cooler and if you blow faster and faster and this is the reasons
why fans are there this is called convective cooling.
So if you colder air pass it or the surrounding air is colder then you will increase heat transfer or
you will make it cool faster. The other is actually to increase the surface area. Suppose you want
to you at home if you are trying to cool milk or cool water you will put it in a wide base so just
exposer lots of surface area exposer actually causes cooling. In fact, it speculated that earlier
dinosaurs had this large fin-like structures. I think stegosaurus or something it had the large fin-
like structure in order to help it kind of cooled down, so this kind of thing this is what is called a
fin all it does is it increases the amount of surface area that is being exposed to air or whatever is
this surrounding fluid.
Now you know trying to design this structure nicely etcetera is a very important problem within
heat transfer especially when you are you know devices small etcetera-etcetera, so now in such
cases we know that a few parameters determine what the temperature is at various places, so let
us called this distance and I have kept the temperature measuring device typically what is
called the thermocouple. Hopefully, you would have studied in this school, so I have kept a
thermocouple at a particular place and I measure the temperature actually technically speaking
this is you know is temperature minus on the outside but you do not need to remember this
exactly.
So you keep a thermocouple you measure a temperature we want to find out our question is a
very simple question with which we will start how does depend on and we can intuitively
see that this is where inherent physical knowledge also comes in that the farther away you are
going to be away from the base the lesser the temperature will be this is our first intuition second
intuition is the higher is the higher your temperature that you measure here is going to be.
Now this is an example what I would call simply physics modelling you do not you know have
you have some idea of what is the input and what is the output but you do not quite know what is
1250
the physics, in this case, of course, we know the physics which is how we are going to generate
the data, but in many cases like in weather you know what things affect what output, but you do
not actually know the physics in fin, the question is can neural networks helps in such cases so
now let me post the question little bit better will it depend only on this ? obviously not we know
that it will depend on the geometry also in some way it will also depend on what kind of fluid
you have outside is it air is it water etcetera.
So all this parameters are abstracted into one single parameter M now what is this magical M I
am not going to discuss this if you have done heat transfer you already know this but let me just
mention what all it actually includes it includes H which decides whether this is water or not how
fast the fluid going etcetera all those are sitting in H this is called the convective coefficient
another thing is K, K is K depends on the material of the solid itself you know obviously that if it
is would temperature would drop in a certain way if it were let us say steel temperature would
drop in another way.
So this depends on the solid. Other things are things like the cross-sectional area how much area
do you have across the cross-section of this fin, how long is this in etcetera, so all these
parameters have somehow been combined into M, now how have they been combined them into
M? through dimensional analysis, so once again our knowledge of the physics of the problem
goes into deciding these factors that these things are important and next step is to use
1251
dimensional analysis also depends on physics knowledge in order to combine them and make the
data a little bit compact.
So finally the problem in physics is as follows, I want to find out how , is remember the
temperature at a particular location depends on the base temperature, this parameter M which
includes all the physics of the problem and the location, so we want to find this function or let us
say approximate this function, now to approximate this function what do you need? you need
data, what kind of data you need?
You need data of various ’s various M, various X. All put together along with what was the
so let us say a lot of people over lots of years in a particular lab or all over the world somehow
measure , , but nobody had a clue what is the function was, so this is, of course, an
artificial example but you can easily extend this to real-life cases like if I said for the case
weather or even in certain cases like you know what is the let us say temperature at this point in
the room provided I keep my air conditioner somewhere you do not have good analytical
expressions for some cases, so what we can make do is with data, so what we are going to do
next in the next video I would like you to think about how you would do this actually let me
show you the data.
1252
Here is one example of the data set that I will be using in order to build our model, so the data
said is as follows for each points or there are lets us say thousand people I will show you I have
about I have exactly a thousand points or thousand points here and these thousand people or
maybe a hundred experiments each of them took ten data sets each kind of measured various
this various ’s and ’s and found out the actual output temperature measured by the
thermocouple at that particular location .
So once they have done that you know this historical data is now available to you now you wish
to figure out how can I make function of , and and I would like you to think about
what structure you would use etcetera you know the input and output are already kind of obvious
in this problem what structure you would use how would you measure this etcetera I would like
you to think about that before seeing the next video, thank you.
1253
Application 1 Solution
Welcome back when I set up a problem I describe the simple problem of fin heat transfer this is
typically a simple extension coming from a base and we wanted to find out given some data
given some historical data of the dependence of theta on theta b, m and x how we generally set a
function this is sort of a trivial task it is close to the task that we took up in the initial weeks of
this course.
1254
So I showed you the actual data also, so this had a thousand data points and you had a variation
of M and X and the output here given was theta now of course the simplest structure when you
have such heterogeneous data by heterogeneous data I mean that it is not all theta B it is not all
M and it is not all X you actually have three different physical quantities one is one physical
quantity is temperature another physical parameters and the third is location.
1255
And you want to combine these three and you want to get theta as out the obvious structure to
use here is the very simple artificial neural network now I am going to do something given the
simplicity of the problem actually this is probably the best thing to try when you have some such
simple problem engineering practice over the last several years show that simply try one hidden
layer we have not done any ANN example so far. I am going to show you this as a very simple
example just use one hidden layer and see how well it does only if it does not do well should you
asked add more complex structure especially when you are dealing with ANN’s, so if I have this
as a hidden layer we will also do something like change the number of neurons in the middle, so
let us say a few neurons are here this is the input layer the hidden layer and this of course the
output so we will keep some such structure and try to see how well it fits these thousand data
points.
So we are going to try that now I am going to use MATLAB for this case I know that throughout
the course people who have been asking for python it is actually fairly straightforward to do this
with python but the MATLAB implementation especially simple artificial neural network is
very-very powerful I would very strongly recommended especially if you are going engineering
practice this works extremely well of course it is not hard to program this within python also
within either tensorflow or Keras or whatever it is that you are using currently there is another
reason for me to use MATLAB.
It has a very nice graphical input to it has very nice GUI interface is very nice also I have found
surprising during the running of the course itself that it can do several things that are actually
quite hard to with a tensorflow we are not going to see examples of that today but it can do
several things it also has a few functionality that tensorflow does not have because tensorflow is
not bothered about a single hidden layer anyway they have built it specifically for deep networks
not for shallow networks of this for that.
I have shown so that my lesson that I would like to point here is in case you have access to
MATLAB the biggest problem with MATLAB of course is that is not open source and it is not
free but to the extent that we have access to it during the duration of this course please you try it
you will find it surprisingly good at many tasks even for deep learning task I have found
surprisingly good performance with MATLAB it is actually quite impressive.
1256
So here is what we are going to do I have called my file finhistoricaldata.xlsx, so let us run our
code very simply I have called my code create network we are here and I have written out the
data in xlsread format basically this reads our input data the excel file that I showed you now the
data is read we would like to label the inputs and outputs separately first three label the first three
columns remember were the inputs and then the next column was the output.
So this is not a very complex code I need not have actually shown this at all now this thing here
NNstart actually starts the neural networks GUI you can do it without the GUI but for
demonstration specially in a video I find it particularly convenient to do this, this is one other
1257
reason with that we are trying to show this within MATLAB rather than python, so suppose we
start so you will see this, this is actually an old toolbox MATLAB now has a deep learning
toolbox that does far many more things much like tensorflow or anything else, so now we are
going to use the fitting app within this so please see this.
So the general structure that is being used with the fitting app is of course you have one input
hidden layer and directly the output layer which is sitting there.
1258
Now it ask for inputs in fact if somebody is interested you should probably make an interface of
this sort for tensorflow or something that all just so that beginners can actually easily get into a
problem, so I am going to give X as input Y is output.
Now please see here within MATLAB they have a default set of values for splitting the data as
training, testing and validation all of you would have seen this during the earlier weeks whenever
you have data you use training data in order to get your parameters validation data in order to get
1259
your some of your hyper parameter and the testing data in order to finally report how good or
bad your results were.
So now it is asking for the number of hidden neurons now what I am going to do this you might
see on the screen is that the current number of hidden neurons this is very first layer are set to be
10 we will leave it as it is we will leave it to be 10 hidden neurons so currently we have inputs 3
hidden 10 and the output layer is just one output, so I will leave it we will change it a little bit
later just to see what the effect is.
1260
Now here is one other place for a single hidden layer that MATLAB winsover it ask for a
training algorithm now remember all we have been using, so far is gradient descent one or the
other version of gradient descent if you check the options here there is something called
Bayesian regularization which we have not done nor have we done scaled conjugate gradient nor
have we done Levenberg Marquardt, Levenberg Marquardt algorithm works very well.
If you have like single layer and your loss function is a least square loss function which is
basically what we want to use for our problem I have all this predicted values what would be a
loss function it would be a least square loss function because it is a regression problem and not a
classification problem if you try gradient descent on the current problem, a) you will find that
you will have to of course normalize the input which we have not done you might note this that
all the columns that we had as input and output were not normalized.
When we did the linear regression problems, you might remember that when we did not
normalize we had lots of trouble with gradient descent whether we use gradient descent,
stochastic gradient descent, ADAM etcetera typically they will be much slower on this problem
than Levenberg Marquardt for this problem, so I will just move the training here and you will see
that within a few seconds it has actually fully trained this simple network.
1261
If you write a code within python my suspicion is that gradient descent will actually work a little
bit slower than what MATLAB did even for this, so you notice a few things that have been
reported remember the statistical measure called R which shows you the amount of correlation.
Let me just show the correlation here this is normalized correlation you will see that the output is
matching really really well if it is a straight line, it basically means that your output are matching
very well that is your predicted output versus the actual output is matching very very well.
1262
If you plot the error histogram you will see that there is an error here is of the scale of point zero
zero nine etcetera etcetera you will see that a lot of view in fact find Gaussian this is a very nice
example of central limit theorem and what we have been talking about even the errors in the
prediction of the neural networks are arranged very nicely as if they were from Gaussian if time
permits I will show you how to use an inverse problem using this idea that the error would be
distributed as a Gaussian probably later this week you can combine neural network with
bayesian, MLE, MAP. Dr. Ganapati discussed these previous weeks also very nicely if time
permit I will do that.
1263
So this is just as an example that the error the answer was not exact but you do have a very nice
error plot error is quite is low, so error you will see is around to now whenever you
trained you look for a few things remember our overfitting and regularization and discussions
you will see that training error is around three into ten to the power minus six this is the main
square error you will also see that the validation and the testing error are round about the same of
the same order of magnitude.
Then that is the case we can kind of assume that we do not have too much variance problem and
we do not have over fitting etcetera so we are reasonably ok we will also see that the correlation
is extremely high which means 10 neurons for this problem is actually pretty good and 10
1264
neurons is nothing as you know from CNN cases so with three inputs and one output you can
actually fit very well just using 10 neurons, now suppose we want to change our number of
neurons and we bring it down to two. Remember, that the mean square error in the previous case
was of the order and if I go here and I trained again you will see that the mean square error
has actually gone up even though the correlation is still pretty good what you notice now is the
errors have gone up in magnitude if we plot the error histogram you will see that it has come
down by a couple of decimal places so reducing the number of neuron even though you have
only few weights right now it is still looks a little bit a Gaussian but you will see that the
corresponding errors have actually gone up.
So if we come here and let me go back and instead of two suppose I use twenty and I trained this
network you will see that the error is extremely low now it has gone to the order of luckily
still the training and the testing errors are approximately of the same size so is the validation so
you know that we have not yet over-fit the problem, so this is a simple example of neural
network if you actually want to use this tool box let me show you an example here.
1265
You can save this network as net, so you will notice here so this is a very simple example of
what can be done with neural networks we had some data set with a thousand points but we did
not know the physics we will pretend that we did not know the physics but we had a lot of
historical data we saw that we could easily fit it using a neural network now this has several uses
as I said it can be used as a surrogate it can be used for inverse problem etcetera in the next video
we will see a little bit of a more complex problem, thank you.
1266
Application 2 description Computational Fluid Dynamics
Welcome back, in this video we will be looking at the second application, the previous
application was a very straight forward one, we had three inputs and we simply had one single
output, the inputs were heterogeneous there and we had a simple output which we wanted to
predict the temperature at a particular point in a certain body and there we saw that artificial
neural networks actually perform really well.
We are going to now look at a slightly more complex example right now this is one of those
applications which has come up only in the last couple of years. So, as usual as I had said earlier,
I will simply be talking about the application right now in this video and I will let you think
about how you could go up about doing this and in the next video we will actually look at how
people have actually solved it in recent years.
So the problem here is that of a surrogate model for what is called computational fluid dynamics,
computational fluid dynamics happens to be actually the area that I work in most extensively.
Now what it is as the name suggests it is fluid dynamics done computationally.
1267
So fluid dynamic simply is the prediction of fluid flow, those of you who are not in fluid
dynamics can still appreciate the kind of example that I am giving in case you are doing any
problem at all in engineering or science, you will typically find some partial differential
equations or in some cases you might find ordinary differential equations, these tend to happen
whenever you are dealing with continuum fields, continuum fields means we assume that we are
dealing with continuous quantities even though we know atoms exist, we typically model a fluid
as if it is a continuous thing and as a solid as if it is a continuous thing and all this belong to the
general field of continuum mechanics.
Now the ideas that I am discussing can be applied to electromagnetics, Maxwell's equations or
fluid dynamics what are called the Navier–Stokes Equation or solid mechanics etc you can apply
it to practically anything including let us say relativistic equations, wherever you have any
governing PDE or ODE the vertical idea that I am going to discuss will actually be pretty useful.
Okay, so the problem is as follows, you have the governing partial differential equations these
are called the Navier–Stokes Equations let me show those equations to you just so that you get
up flavour of what they look like.
So these are the Navier–Stokes Equations, they represent the ones that I have shown here
represent what is known as Momentum Equation and they represent the fact that that is
Newton’s second law is applicable to a fluid also so that is where you get this, these terms that
1268
you see here as you can imagine when I want to describe a fluid flow let us say it is around an
aircraft or around a car or inside an air conditioner wherever it is all of it has three properties that
we know are satisfied we know that mass is conserved, we know that momentum obeys
Newton’s second law and we know that energy is conserved.
So these three put together describe fluid flow, now even though they describe fluid flow you
can see that the equations as they occur here are fairly complex.
Now very quickly if you have a 2 dimensional flow that is let us say the flow over you are just
looking at a plane channel where you are looking at this flow, so if we look at a 2 dimensional
flow it will have three quantities that we would like to describe that is the velocity component
in the x direction at a particular point, the velocity component in the y direction at a particular
point and pressure, so pressure at a particular point.
Of course you would have density also, but density in case it is what we call incompressible
which is what I am going to assume for this video and the next as long as it is incompressible
density is a constant, okay. So you have two equations here for two of these variables, you also
have a third equation this is the mass equation this is written in this way, once again none of you
have to know this in order for the exam or for you to successfully do the (())(4:59) this is just to
appreciate what we are going to do in this video and the next.
Okay, so as you see this you will see that the equation is actually fairly complex in fact this term
is somewhat nonlinear and with fluid mechanics with Navier–Stokes we know that there are
generally no analytical solutions, so you cannot solve them like the analytical solutions that you
would have found in class, etc people in CFD know this very well, in fact this is a very big
problem.
Now how do we solve this in general? If we cannot have analytical solutions, how do we actually
solve the fluid flow equations? In fact the planes that you travel in let us say a Boeing, etc all are
designed on the basis of solutions to these very equations. So this is actually an advanced art and
science right now to actually solve these equations and being able to predict what happens in a
flow outside a body etc this is really how we design things.
1269
Okay, so suppose you want to predict this flow, what you do is apply an approximate numerical
solution you cannot find exact solution, so you find out an approximate numerical solution and
∂u
the way you do it is let us say each derivative for example the derivative here you
∂x
u( x+ δ x)−u( x)
approximate it as ∂ u= . Of course I am giving you a very simplified and
δx
simplistic example of how we do this.
So the way we do it is instead of having this continuous field which is what you will get, in case
you have an analytical solution, what you will do is, you will say I want my solution at some
finite points, tell me the solution at let us say nine nine of these points I just want these nine
points. So what you will say is, well I will approximate the derivative here as the value here
minus the value here by δ xor the value here minus the value here by 2δ x, where δ xis this
distance.
1270
Similarly whenever you have a y derivative, you can approximate this byδ y, so on and so forth.
Okay, so this is called the Finite Difference Method this was one of the first methods that was
tried about (a thousand) about a hundred years ago, okay. Now we have come to much more
sophisticated schemes, but the basic idea still remains the same, what happens is this set of PDEs
become a set of linear or algebraic equations, okay.
So each PDE is converted to a bunch of linear equations and then we know how to solve linear
equations, we know how to solve combinations of linear equations or even combinations of
algebraic equations and you get the solution. So this is how you get what is known as a Flow
Field, so please underline the word field, field means you actually have a distribution of the same
quantity everywhere, okay.
So this is our aim, our aim is to start from the equations and I will tell you the specific
circumstance that I am in, the circumstance is usually given by Boundary Conditions that is
obviously even though it is the same equations the flow past and aircraft is going to look
different from the flow past a cycle it is going to different look different from the flow past a
man, okay and what decides it? What the boundary of the object is like, so these are called
boundary conditions, okay.
So PDEs plus boundary conditions passed through the finite difference method give us the
solution. So this is the current situation.
1271
Now the problem with this is for complex bodies it can take you a very long time to actually
solve these equations. So each time for example let us say I am doing a design of an aircraft or
design of the shape of a car of a racing car, each time I make a small change in the shape of the
body, I will have to go through a lot of computational time in order to get one solution.
In fact even in India at other places we know that this kind of computation called CAE basically
you can find out lots of approximate solutions to problems that are facing you whether they are
fluid mechanics or whether they are solid mechanics, we can solve these problems through the
computational means and we know that even in India we spend a lot of money, you know big
1272
companies like Ford Chrysler, etc they spend a lot of money in just doing this that is because in
each design cycle you have to go through many computations.
So here is the question that I want to race, the question is can CFD learn? What is meant by can
CFD learn? What I mean is suppose I have done hundreds of computations in the past. So if I
have done these hundreds of computations the hundred first computation should I be solving the
same equations again right from scratch or can I somehow use my data from previous
simulations and find out a simpler model that will run faster.
In other words, can I find a surrogate model? Can I find a surrogate model for CFD? Now why
would this be useful? That will be useful because the example that I gave you when I want to do
computations of any design of a let us say a car, instead of running the full CFD model, suppose
I could run a simple surrogate model which will run much faster than I could get my solutions
much faster, okay.
1273
So before I leave you with this question let me show you a quick example of what kind of
solutions we are expecting from this surrogate model? What you see here is what is called flow
past a cylinder and what is flow past a cylinder? As you can see you have some fluid coming
from here at a certain speed let us say capital U and there is a block sort of like you know you
have kept your hand you are outside of the window and you can see that flow will go past as you
are going in a car, obviously it is not a good idea to put your hand out from a moving car, but I
just wanted to give this as an example.
So let us say, you have a flow past a cylinder and you want to predict the pressure, v x, v y , etc
given the flow field. What you see here are contours of pressure and as well as velocity vectors,
so you can see these velocity vectors here. So let us say you have done a lot of flows, so let us
say you do flow 1 flow past a cylinder this gives you flow 1, then you do flow past a square all
this using CFD and you keep on doing some simple shapes let us say triangle and we ask a
certain question which is I give you a very complex shape let us say a car, what flow will it give?
Or what velocity field will it give?
To repeat the question, we repeat we solve flows over elementary shapes going back to the
previous simple example that I did in the previous video I had data for previous history of you
know θ b, X and M and I gave you θas the output. Now similarly I solved for various flows and I
1274
got the full flow field, now I want to know that if I have a new flow or a new body, what is going
to be the flow?
So I leave you with this question, please think about what would be the appropriate input? What
would be the appropriate output of this problem? What would be the architecture that we would
use? What would be the loss function? So these are important design choices within deep
learning. So I leave you with this question and I will come back with the solution as people have
solved it in the next video, thank you.
1275
Application 2 solution
Welcome back, in the previous video I had set up a problem of trying to find out a surrogate
model for computational fluid dynamics. In this video I will talk about one very recent solution,
in fact this is a very good paper it is in 2016 created by people in Autodesk. Normally I would
have like to show you the full paper this paper is available online legally I am just going to
describe results from this paper, there are people within my lab also working on similar
problems, but this is the original paper I have taken sections from this paper instead of showing
you the full paper because it is little bit convenient for me, for me to show a few sections from
this paper.
So please do read the full paper, it is very well written and the results are actually really really
impressive. Okay, so what they used was as you can see in the title CNNs, so the first trick that
they have used which I would like you to think about seriously since CNNs have become
extremely powerful over the last 5, 6 years is the fact that any field this is the principle, any field
can be interpreted as an image.
1276
So to explain the flow past a cylinder I showed you in the last video, you can interpret this if you
just look at the contours that are shown there as an image even the whole thing right now is just
appearing to your eyes as an image. To me as a person within CFD this is velocities and
pressure, etc but to you, to all, it is in case you are not familiar with CFD it is just an image. So
we are actually going to exploit or these people, these researchers actually exploited this fact that
what looks like physics can be interpreted either as data or as image.
Now before we proceed please remember that this interpretation is a little bit easy to apply in
case you have what I will call homogeneous data as in when I see a pressure contour here, this
contour or the colors that you see here represent just one physical quantity, I mean it is not as if
this portion is pressure and this color is velocity and this color is something else. So when you
have heterogeneous data and it is not field data, it is usually useful to use simple ANNs, CNNs
are useful when you have images, why is an image useful? Remember the basic idea of a CNN is
that spatial closeness actually matters and we want to account for that within our weights or
within our kernels or within our filters, okay.
1277
1278
So what the idea was that these people used was the input is simply a section of the flow and it is
taken as an image and you will see the size of the image later on when I show you the
architecture, they took one section of the whole flow, the flow can actually be quite big, there is
one portion that you are interested in, you just cut that out and you take that as an image. now,
this idea, of course, is very general I will show you another iteration of this idea when we discuss
problem 3, but this idea of interpreting a field as an image is actually a very straight forward one,
okay.
Now when we take an image please remember the image itself for the sake of this particular
paper, they took two images one of theux , another one of the component u y both these you can
just take as images, okay. Now, these are our outputs (I am sorry I should have called this output
rather than input) so what we are interested in is actually the velocity of the flow as output, okay.
Now what is the input? This is a little bit more interesting, okay. Now we know that what causes
the flow in this case, for example, is the shape of the body, now the shape is a little bit vague,
now I want to if I change this shape to let us say a square, as a fluid mechanics person I know
that I am going to get a different flow, you might know this or you might even see this intuitively
even with even if you are not within fluid mechanics. So the shape is what determines the
boundary condition, the boundary condition is what determines the flow finally. So if I change
the shape the flow changes, so that is the input but how do I give it?
1279
So we know that this is supposed to be the shape, but how? So we have several options let me
give you a couple of options actually the people within the researchers within this paper
themselves considered it. So one thing is they had decided that they are going to give they are
going to use CNNs okay so that much is decided. Once you have decided that you are using
CNNs pure CNNs you have to give an image as an input, that is also determined, the special
thing here is we have an image as output also, it is not a classification it is it is actually a full
image just like you saw in segmentation tasks, okay.
But the input is also an image, so how do we give an image as input? So let us say I want to give
the image of a circle as an input, there are a few ways of doing it, one very simple way is to say
let us say this is 256 ×128pixels and within each pixel I will label it as 0 if it is inside the body
and I will label it as 1 if it is outside the body, okay so is this clear? 0 if inside the body, 1 if
outside the body, this is one possible input.
Now as it happened when people did this, when the researchers did this the results were kind of
okay but not so good, okay. So remember when I have this kind of problem what I am saying is
for a specific input I am going to somehow map it using some CNN architecture to an output,
image to image, okay. I am not going to talk about the architecture right now, I will talk about it
in a few minutes.
But somehow if you are able to map this input to an output, they saw that if you label the input as
0’s and 1’s it does not work very well. Now, why does it not work very well? Here is where your
domain knowledge comes (into place) into play, okay. Now we know from fluid mechanics that
it matters how close you are to the body? Perhaps you can see this even whether even if you do
not know fluid mechanics, the further and further I am away from the body, the lesser and lesser
it will matter what the shape of the body is, okay.
Similarly if I am close to this body and I label that pixel or location 1 and I am far away from
that body and I still label that pixel or location 1 then it seems like some portion of the physics of
the problem I am completely throwing away, I am saying that this portion of the body or this
portion of the flow is as important as this portion of the flow as compared to the shape of the
body which is not true.
1280
So these people actually came up with a slightly modified idea it is an intelligent idea it is called
a signed distance function, okay what is a signed distance function? It works as follows inside
the body it will be negative, outside the body it will be positive and the value will be the distance
closest distance from the point to the body. So suppose you are at this place they will find out the
closest distance from there to the body and if it is outside they will label it positive, if it is inside
they will label it negative and this way you have something that accounts to the for the fact that
points that are closed to the body are actually more important than points that are further away
from the body as far as the boundary conditions are concerned, okay. Let me show you a contour
for this.
1281
1282
So you can see here once again they have drawn this signed distance function exactly for a circle
as I have shown here, you can see that 0 distance is dark blue so somewhere here is where the
body is, further and further away is more and more positive, inside it is negative and you can see
this body, actually they have shown the outline, this is actually the body, okay. So the body is
here and this is how they have given the signed distance function.
Now for any object, you can give a new signed distance function let us say it is an airfoil once
again negative here, positive here and then find out the shortest distance from any point to the
surface and then label it positive or negative. So this is how they have given the input and once
again here is where we have used some amount of domain knowledge, okay. Now that we have
decided the input, we have decided the output, we need to decide two more things, we need to
decide the loss function and finally and in some sense most importantly you have to decide what
is the architecture, what map, what model are you going to give mapping this image to the image
of the flow.
Okay, so let us first discuss the loss function, the loss function is very simple for each flow I
have my ground truth, this is of course in the training set, the ground truth is the CFD solution.
So you have some code they used something called LBM [Lattice Boltzmann Method] in order
to generate lots of flows, so they found out CFD solutions and you have your prediction, as
usual, let us call this y, we will call this ^y .
So the loss function you have several choices, I will just mention one that you could use
1283
¿ ¿of course this is our least square loss function except with a small change, we will multiply it
by one small number I will just call it in or out. We obviously know that the flow within the
body here even if the CNN gives the prediction that is actually useless.
So this here we have no flow at all. So even though you are still since you are interpreting the
whole thing as an image you are going to get a loss inside the body that part we will not count,
okay so (in means we will use 1) sorry in means we will use 0 we will not count the loss, out
means we will use 1 so that is only loss outside the body is calculated.
Another way to say it is if the signed distance function is negative you will assume that there is
no loss and if the signed distance function is positive we will actually calculate the loss and
incorporate it (into the architecture) into the algorithm, okay. Now the last step is we want to find
out what the architecture is that maps input to output, so let us see that. I am going to give a
rough architecture now and then I will show you the architecture that they used. Now remember
we have an image as input, this image was the signed distance function we call it SDF this was a
256 ×128 image, we also have an image as output again256 ×128, now you have two possible
images that you have to give, one of these will be the x velocity field and one of these will be the
y velocity field, okay.
Now once again it is our domain knowledge that tells us that there are effects of two things, one
is the effect of physics which is the PDEs and another one is the effect of the geometry which is
given by the boundary conditions.
1284
So what we will do is we will use the first thing, remember I want to go from an image to an
image. So typically this is very similar to what you did in segmentation tasks we will use what is
known as encoder, decoder structure you would have seen this several times during the CNN
lectures. An encoder, decoder structure does the following, it takes an image squeezes it into
some essential information and then expands that essential information, it is sort of like zipping a
file and then unpacking it, you just remove the unnecessary stuff, pack it into something and then
expand it.
Now what it does is essentially only the important pattern stay and the other stuff goes out, we
will do the same thing encoder, decoder back into the full image, okay. Now an important thing
is should you use two encoders and two decoders or one encoder, two decoders, etc because
remember we are trying to predict two things and here is where our physics knowledge comes
into play once again, we know that when I am encoding what I am encoding is effectively the
geometry, this is not strictly speaking true, but it is true enough that I will just leave it like this, is
what I am saying is regardless of whether it is the U velocity or the V velocity your geometry has
to be squeezed.
So, therefore, this can be common because it only pertains to the geometry and not to the
particular velocity field. And the decoder actually tries to incorporate physics and then you can
split it into ux and u y , 2 separate decoders are used and of course the corresponding loss function,
okay.
1285
So let me actually show you the structure that they used in their paper, here is the CNN
architecture that the people used in this paper. Now notice what is happening here, this portion is
the encoder until here, took a 256 ×128image, squeezed it and got a (fully convolutional) fully
connected layer of 1024, now this portion is a decoder portion and you have split it into two
sections, one is for ux and one is for u y , okay.
Specifically also see that you have 128 in the very first layer, 128 filters each of size 16x8 which
is why this 256 x 128 now got squeezed into 16x16x128 as a simple exercise please think about
what is this stride, the stride as you will see is actually 16x8, so what it is doing is it is taking a
block, it is not simply sliding, takes a kernel, it jumps, jumps and comes down. I will leave this
to you as a simple exercise Dr. Ganapathy had discussed this in great detail, but this is just so as
to avoid too many filters and you know too much computation, okay.
So this is a simple structure, similarly the next step is to use 512, 4 x 4 filters and finally there is
a fully connected portion and I like I said it is possible to interpret do not quite accurate to think
of this as squeezing the geometry, the reason it is not quite accurate is we are actually doing back
prop throughout, so some effect of Ux and Uy is actually going to go through. Okay, after this
they have a decoder structure, you have transposed convolutions, you know you sort of expand
this back right back to the same original size of 256 x 128 you would have seen this several
times during the segmentation tasks.
1286
Finally, we take the loss function, you get x component of CFD, you get y component of CFD
and then you train this, okay. So once you do this the results are actually quite surprising and I
would like you to see these results right now. Now this work by Guo and co-workers what they
did was they used simple shapes the kinds that I had described earlier for which CFD tends to
work a little bit fast for several technical reasons, they used circles, square, triangle, etc some
polygonal shapes, okay and oriented them at different angles and found this out and found out
various flow fields and they actually gave far more complex shapes such as for testing, so this is
for training and what is remarkable is unlike our usual CNN examples where training and testing
actually in some sense look similar, they tested on far more complex shapes.
If you know CFD, you know that from here to here mapping is actually quite hard, you cannot
really say something quite obvious about this, so cars, wings, etc.
1287
Now I am going to show you their results again I would refer you back to the paper, it is a very
good paper it is available for free for academics as well as non-academics on the web please
search for it, okay. So here you have their results, what is written as LBM is essentially CFD we
can take this as ground truth and here is the prediction from CNN, you can see that you know
picture to picture the match is actually remarkably good, this is actually extremely good, their
error was around 1 to 2 percent.
So you can see the error drawn here, now as a person in CFD we know that you know these
things can be somewhat worrying, so there is some error near the boundary which is actually
important for people within CFD, but nonetheless overall this is actually a remarkably good
prediction, the speedup is actually quite huge, so speed up of order a few hundreds depending on
whether they use CPUs or GPUs their speed up rates were quite different.
So this is actually an excellent surrogate model, why is that? Because you can get up till 2
percent of the answer atleast in whole flow field you can get up to 2 percent of the answer about
200 times faster. So you can imagine if your design cycle is predicated mostly on CFD, it is not
always, but let us say 60 to 70 percent is actually dependent on CFD you can actually get a speed
up of a factor of 10 or 15 in your whole design cycle, something that takes an year can take about
20 days.
1288
So you can see that just by using CNNs you can get really rapid this is of course this is for some
limited cases we are not there yet in terms of full 3D and all sorts of complexity flows but this is
a great beginning. So I would refer you all to this original paper by Guo et al whether you are
interested in CFD or not? Of course I spent along interminable amount of time talking about
CFD and I will briefly introduce the next two problems because they are somewhat related but in
general a CAE problem CAE being computer-aided engineering, CAE problems, in general, can
gain tremendously you know people even within chip manufacturing companies all of them do
some computation or the other and CNNs have tremendous applications there, I expect more and
more papers and indeed many papers have come out over the last year, this was 2016, 2018 to 19
many papers have come out and by the time we finish this course and by the time we finish this
year is you will probably see the literature flooded with such examples. And I encourage you to
just look at CNN application of problem x wherever x is a field problem, thank you.
1289
Application 3 description Topology Optimization
Welcome back, I would now like to describe the third problem which is topology optimization, I
will spend only a brief time describing it. This work was done by what I will show you in the
next video was work done by a couple of students at IIT-M Harish and Sai Kumar, I will show
you their results shortly, but I once again in the spirit of these lectures I would like you to think
about how to pose this problem.
Ideally, you will see that there are a lot of similarities within this problem and the CFD problem
that I discussed earlier which is why the students did it, okay. So in this case what we are trying
to do is what is called a topology optimization that is you try to redistribute material of a
structure by giving certain design constraints, design constraints could be I want to use only so
much material, okay.
So suppose you have a particular shape and a particular loading condition. For example, you
know you have a building and you have certain beams that are supporting it and you want to use
as minimal material as possible and that is your constraint, so you want to use this. And
1290
obviously you will have constraints such as it should not break, so you do not put such weak
material or such sparse material that it actually breaks.
So in case you know the solid mechanics of the problem, you can use it and redistribute the
material I will not get into the detail, nor will I get into the equations in this case that is too much
beyond the requisite amount of knowledge that you require for this course. So what we will be
doing is you take shape of this sort and kind of remove material and redistribute it optimally,
okay and the usual method for doing it is using some Eigen frequency maximization, etc.
So we will look at one very very simple problem and I would like you to think about how you
would do this?
So the problem is as follow, the problem is we have something called a cantilever beam, okay. A
cantilever beam in case you do not know is simply a beam that is fixed at the wall at one end and
it is free at the other end and people are able to force it down. So you can see that structure here,
here is a wall, here is the beam and you are pushing it down. Now what you would like to know
is you know given that I am able to use only a certain percentage of this mass, what is the
optimal way for me to distribute this mass so that this thing does not break. So that is the
problem that we are going to consider, okay.
1291
Now our constraints are as follows, our constraints are I am going to give you what volume
fraction, what percentage of volume fraction you can use, okay. So if I give you 10 kg of
material and I will say that 70 percent is what you can use, you will have to make do through
away 3 kg of material in some way so that it still does not break, okay. Another variable here is
something called Poisson’s ratio, Poisson’s ratio is I will describe it again very very simply in
that if I squeeze something in one direction, you know that it will move in the other direction
also.
So the amount that it moves in the other direction in some sense very very vaguely is Poisson's
ratio it will not just come out, it will actually move in other directions also. So strain in one
direction versus strain in the other direction is Poisson's ratio, okay. Given these constraints, how
would you pose this problem? What would be the input for this problem? What would be the
appropriate output for this problem? And what would be the network architecture for this
problem? Please think about it and I will show what Harish and Sai did in the next video, thank
you.
1292
Application 3 Solution
Welcome back, I hope you thought about the problem and notice that it is very similar to the
fluid mechanics problem that we did in the last video with one or two notable differences. So the
thing that is similar of course is your inputs are somewhat similar though you do not have you
know the signed distance function or anything as input, you can simply give the shape of the
beam as input and in order to train you can give this as output, some optimized structure.
As it turns out the data set was generated by something called the 99 line code by Sigmund, this
is available in MATLAB and this is what Harish and Sai used in order to create a data set, okay
so they did that, so let us assume that you have this data set, okay. Once you have that data set,
how are you going to give the input? Remember the input that we have has two other constraints
also as input, one is the volume fraction and another is Poisson's ratio, so you want to specify
both these as input, okay.
Now for a CNN it is ideal if we use images themselves as input and images as output, there are
multiple other structures possible for this problem, but I will show you just one for this which is
an image as input and image as output, okay.
1293
So this work as I said is by Harish and Sai Kumar, so let us see what they used should input and
output.
So here we see what they used as their input, what they did cleverly was to use the volume
fraction as an image. So for example if your volume fraction is 80 percent you have an image
which is 100 by 100 and about 80 percent of it is filled so that in some sense over iterations the
algorithm learns to use only this much yellow and redistributes it amongst the figure so as to get
the final topology optimized structure.
1294
The second thing was to use Mu which is also a ratio which is also a decimal number also as a
ratio except in this form, okay. So both these were given as two layers of inputs, remember when
we were doing CNNs and we had multiple layers RGB. Similarly, in this case, Harish and Sai
used two layers one of them for the volume fraction, one of them for the Poisson's ratio and then
they gave the optimized structure as output, once again you train on multiple outputs, I am going
to show you the network architecture that they used, it is very very similar to the structure that
you saw for CFD.
So here is the structure that they used, here is the input, once again an encoder, decoder structure
you will see that it is fairly similar to what we saw for CFD, once again you kind of
systematically come down here, then systematically build it up back to the target, the loss
function is of course the least square loss function, you can use other loss functions also but that
is what they used.
So you can use this, you can also treat it as a segmentation task for (())(4:06) and then they
trained it, okay so you saw that you will see that they actually had pretty decent success with this
kind of architecture.
1295
Now here are some few results for this problem as obtained by Harish and Sai, you will see that
for a volume fraction of 45 percent and a Poisson's ratio of 0.2, on the left side is ground truth, so
you can see that 45 percent of the material has been redistributed in various ways and the right
hand side is our CNN solution. Once again you can notice that actually visually they have it has
succeeded quite well, of course you can never create material of this sort, you know it has
removed some arbitrary small points here and there.
But as an initial cut for what the shape will look like this is actually remarkably good. you will
see some difference here of course, okay but rather than that this is actually a remarkably good
solution. So this problem also shows you that you can take problems that are fairly complex at
least physically they are fairly complex, I do not think any of us can intuitively say you know
where should be the distribution of these gaps just by being given 45 percent of the material
should be there, you can see that remarkably which portion should be empty, which portion
should be full has been predicted really really well by this CNN structure showing how powerful
it can be even when you map images to images.
We will see I will actually discuss towards the end of this week some other examples of how you
can combine CNNs with LSTM, I will just briefly describe it, I will leave it to you in order to
actually read the original papers.
1296
So once you have this powerful idea of using CNNs for field data, you can combine it with
various things, you know various other architectures CNNs plus ANNs, CNNs plus LSTM to do
all sorts of problems, I will discuss very briefly not in as much detail as I have done now,
towards the end of this week I will have a very short section on what else can be done with
CNNs in that.
In the next video I will discuss one final problem that will be the final application for this week,
the final application is going to be how do we actually solve from scratch without any examples,
in all these cases you had to be given examples, without examples how can we solve ODEs and
PDEs. So we will start with that in the next video, thank you.
1297
Application 4 Solution of PDE/ODE using Neural Networks
Welcome back, in this final application for this week we will be looking at something which is
different from all the three examples that we saw before, this is the solution of differential
equations using neural networks. So as you know in almost all of engineering and science most
of us encounter differential equations in one form or the other. Now these differential equations
can either be ODEs ordinary differential equations or PDEs.
1298
Now here is the question can neural networks be actually used to solve ODEs and PDEs? Now
unlike the other three applications, I will not you know make a separate video for this just to
describe the problem and then ask you to think about it because this is actually honestly it is a
very very clever idea and it is unlikely that one would think of it on one it's own. In fact, it will
look very very different from almost everything else that we have discussed in this course
because you will not clearly see a training, set a testing set, or validation set, etc so please do
remember this as I go over this example.
The idea behind this example goes back to a set of papers by Lagaris et al, this is late 90s, early
2000s if you just put Lagaris neural networks you should find these publications. The application
that I will be discussing right now is by a set of researchers from Brown, Rize et al, as you will
see I will show you the paper shortly and this has just been published last year, so this is
extremely recent. The reason for including this A, of course, is ODEs and PDEs affect us
completely within engineering and science, B this is very different from the other applications
that we have seen so far, in fact, this is not a surrogate model, okay.
So remember in the CFD video I had shown you how actually differential equations are
discretized or using the finite difference method we kind of make an approximate equation. Now
in this case we are not doing that you are using an entirely different method, you are posing
every ODE or PDE as convert this into an optimization problem and you will see this extremely
clever way of doing this, okay. Like I said the original idea was by these researchers and I would
you know request you to look at it in case you are interested in this field I do think that this is
going to get very important and before I describe the problem I would like you to I would like to
talk a little bit about what the importance is.
So remember that when we were using finite difference method or whatever method that we had
used to generate the CFD solutions this was still you know in the usual way. Now in this case
may be using neural networks you can automatize the whole thing and the neural network can
both solve as well as learn from the solution, so this is moving towards full automation of
solving differential equations using neural networks that is first you solve it using neural
networks and then you make a surrogate neural network for this original neural network this can
look a little bit confusing but I do think that this is going to be slowly it will start seeping in into
the mainstream of a lot of CAE solvers, so that is the reason for discussing this.
1299
So let us take a very simple differential equation. So let us say I have the differential equation d
square T dx square, in fact, let me change this variable to something else, let us call it d square u
dx square let us say a du dx equal to b.
2
d u du
2+a
=b
dx dx
Suppose you wish to solve this, you will also be given two boundary conditions without which
you cannot solve this it is not a well-posed problem.
So let us say u 0 equals u 0 and let us say this x lies between 0 and 1 and you have two boundary
conditions.
u(0)=u0
u(1)=u1
Now you can do this in multiple ways, of course, you can do this analytically, okay this equation
is solvable analytically you would have seen this the second-order linear differential equation or
you can do this numerically using the kind of method that I discussed.
Now there is a third method that uses neural networks it is a clever method which is as follows.
We assume that u is some neural network that takes in excess input and gives you as output, just
1300
to make this clear diagrammatically the neural network will look like this x this is the input of
course typically we have a bias unit and then something happens here this is the neural network
it could be in case you are getting unclear we can simply assume it is a single hidden layer and
after all this you get u as output.
Now, why is it possible? Regardless of which differential equation it is, we know from the
universal approximation theorem that I can always approximate the solution of u arbitrarily
closely by a neural network, okay so because that is possible I can always assume that u is some
neural network of x, okay. Now how does that help us? It helps us because suppose I postulate,
so suppose I decide you know just like we did with you know theta B, M, X example I take x let
us say I put in 10 neural network or 10 neurons and here is u, okay.
Whenever we do this we are actually writing a full functional form for u, how is that so? Let me
take a simple example let us say x is there and for now, I am going to forget the bias unit, let us
say there is a hidden neuron A, let us say this is my simple model and then I have u, okay. So
what this says is a1 is, of course, sigmoid of some w1x and u is let us say this is a linear layer
and this is sigmoid in it and u is some other w2 times sigmoid of w1x.
a1 =σ (w 1 x)
u=w 2 σ (w 1 x)
Now suppose I want du dx I can actually calculate this analytically I can write this as w 2
sigmoid prime w 1 x times w 1.
du
=w 2 σ ' (w 1 x) w 1
dx
Similarly, you can calculate d square u dx square etc that is all derivatives this is the key point all
2
d u
derivatives of u with respect to the input x can be found. 2
dx
Now you might think that this is possible only because I had a single hidden neuron, what if I
had 10 hidden neurons or worse still what if I had multiple layers of hidden neurons?
1301
Okay, even if you have multiple layers of hidden neurons through backprop you can always find
out d of the output with respect to d of the input just like we did d of the loss function with
respect to the weights, you can similarly find out using the same backprop idea this is called
automatic differentiation, so same idea you use output with respect to input, in fact, tensorflow
has inbuilt functions that do this, okay I will show you the function I will show it to you from the
paper you know how they actually did it.
So the point is this the moment you give a neural network we can find if u is a given neural
network of x, we can find u prime of x which is du dx, u double prime of x, etc okay.
1302
u=NN (x)=> u' (x) ,u'' ( x)
Now, this makes things easy for us, why? Because now instead of saying I am solving this
equation I will pose the problem as follows d square u dx square let us call since this is an
approximation we will call u hat just like we did y and y hat, so the original is u, the neural
network predicts u hat, so what u hat do we want?
minimize ¿
We want u hat to satisfy del square d square u hat dx square plus a du hat dx minus b equal to 0,
but obviously, it is not going to be exactly 0, so what do I do?
I square it and say minimize, okay. So this is an extremely clever posing of the problem, instead
of saying I solve d square u dx square plus a du hat dx minus b equal to 0, I say minimize this
and obviously the actual minimum will be only for the exact solution because that would be 0.
Now what you will get in the practice, of course, is something a little bit closer to 0, but it will
not be exactly 0 because our neural network will in general not approximate this you know not
get the exact solution, okay.
Now this tells you that this is the cost function that you have to minimize and the moment you
put in the neural network it can for a given w let us say you initialize with some w’s let us say
w1 w2 in the example that I gave, you put that in calculate this from the neural network because
w1 and w2 are given you can actually do forward prop calculate this, calculate this, calculate this
try and minimize, do gradient descent this is our new cost function.
But if you have been paying attention so far you will notice that this is not sufficient because we
have this condition also, what do we do about this? This will just satisfy the ODE but it will not
satisfy the boundary conditions, turns out that is straightforward also, all you need to do is add
that also to the cost, how do you add it? You say u hat of 0 remember if I give 0 as input in this
neural network it will find out a u hat minus u 0 which is supposed to be the exact solution plus u
hat 1 minus u 1 square, so this total thing all put together is our loss function.
Loss function = minimize ¿
1303
Please think about this, this is an extremely clever posing of the problem so that differential
equation and the boundary conditions have all been put together as a single optimization
problem, so the ODE has now been converted to an optimization problem and after this it is
simply a solution, okay so how will you solve it? You try various values of x that is let us say
you have 0, you have 1, now as it is posed for each x, so you will put x equal to let us say 0.1 run
this that will give you a residual or that will give you a loss, 0.2 that will give you a loss, so on
and so forth.
So let us say we put 10 points and say add these 10 points I will calculate how much this is and I
will try to minimize that is it there is no training set really if you wish to you can call this the
training set that is any arbitrary x point, but this is not supervised in any way because I am not
giving you a label, all I am telling you is this is the function to be minimized I can automatically
find out the values of this function there, this differential equations residual there add it together
and try to minimize it, okay so this is the formulation of the problem we will see you can do this
obviously for PDEs, ODEs anything this is just a fantastically universal method of solution and I
will just show you Raissi et.al who are the authors of this paper, the solutions that they have
found out I will show you those for a couple of problems, so let us see those.
1304
Okay, so here is the link to the paper, it is an archive well it is now been published in journal of
computational physics I think this year only 2019, but you can search for this they call this
physics informed deep learning or physics informed neural networks PINN as they call it, it is
physics informed because unlike all the previous cases that we saw remember when we were
doing the CFD solution computational fluid dynamics solution of a car etc I had no knowledge
about the physics, only the training set had knowledge about the physics, my deployment was
simply a CNN it took it as an image.
Here the neural network is trying to impose the differential equation, so that is why it is physics
informed the differential equation obviously comes from physics, okay. So this is the paper I
would highly recommend that you read it, it is extremely well written, it is very very well
written, very clear paper, they have their code put up on Github, the links for their code are
actually here and the code actually works as advertised of course there is a Jupyter notebook you
can just open it and run it I will show you some outputs from their paper, but I would highly
recommend that you go back to the original paper and take a look at it, the archive link is given
just in case some of you do not have access to JCP they have actually very kindly put up their
original paper on archive also.
1305
Okay, so here is the equation that they have tried to solve in the paper, this is, of course, a partial
differential equation, this equation is known as Burgers equation, okay you will see u t plus uu x
minus u xx this is what is called the Viscous of Burgers equation.
1306
Now unlike simple ODS you have to give a little bit more these okay so this has both x as well as
time so what happens is at some initial time t equal to 0, you know what the function looks like,
okay. So, in this case, they have said that at time t equal to 0 my function looks like sine x, the
question is what does it look like at a later time? So this is called marching you start from t equal
to 0 and you slowly move forward in time and they want the solution from t equal to 0 to t equal
to 1, okay.
The postulate is as follows u of x comma t is a neural network that takes x and t as input and u as
output, so x, t, neural network u if I remember correctly they have tried various number of layers
1307
and various number of neurons for this problem they have if I remember correctly (9 20 layers)
no I think it is 9 layers with 20 neurons each, so please do refer to the paper I might be wrong on
this number, okay.
So the boundary conditions they have given this is what is called the initial condition apart from
that you will have to say that as you move forward in time this is X remember as you move
forward in time what happens at the boundaries okay so what they have said is at the boundary
the boundaries are fixed at 0 so X goes from minus 1 to 1 and both the boundaries are always
fixed at 0 and we want to see what the flow solution evolves like physically we have some idea
but I am not going to discuss this because very few people watching this video would actually
have directly any physical knowledge of the equation, okay.
Now going back to the original idea if I pose it this way then my loss function is as follows I will
say minimize u t which is now a neural network with respect to t remember it takes x and t as
input I can always find the derivative of u with respect to t using backprop, similarly u with
respect to x backprop and suppose I want u xx I do backprop once, backprop once more okay
when I say backprop, backprop is a slight abuse of notation it is actually what is called autograd
or automatic differentiation but it is it works very very similar to how our backprop works, okay.
So u t plus uu x let me call this coefficient Mu because it is just a constant minus Mu times u xx
this has to be 0 but instead I will square it and say of course I will assume this is u hat now the
other condition is when I said t equal to 0 and x is x this function should become minus sine Pi x
again u hat might not satisfy it, so I will say u hat u x plus Sin Pi x square. Similarly, I have one
more condition here u at t minus 1 should be 0, u at t 1 should also be 0, so all these added
together will give me my loss function.
Loss: ¿
So every guess that you have for the weights automatically guesses some connection between x
and u and when you differentiate that and add these conditions you want to minimize the total
loss of that which gives you gradient descent for the weight, okay so very very clever sort of
method of implementing it that is basically what they have written in their paper this section is
right from there instead of calling it u hat they have called this f is the residual that is what
1308
remains when you do this calculation if you want just to be consistent with our notation you can
put it as u hat, okay.
They have actually given a Python code snippet in their paper but they actually have given their
full code also online, so this is actually a good problem to start with actually some of you might
find it very interesting to start with this in case you have ever work with differential equations,
the code is extremely well written very very well written, very clear so that is one other reason
that I would like to recommend it, okay.
def u(t, x):
u = neural_net(tf.concat([t,x],1), weights, biases)
return u
def f(t, x):
u = u(t, x)
u_t = tf.gradients(u, t)[0]
u_x = tf.gradients(u, x)[0]
1309
u_xx = tf.gradients(u_x, x)[0]
f = u_t + u*u_x - (0.01/tf.pi)*u_xx
return f
So here is just some sections that they have the definition of how they have defined the neural
network and also how they have found out the gradient, so you see here this here is del u del t,
this is del u del x and this is a gradient of a gradient they have found out this is automatically
available within tensorflow, okay.
1310
So once you do that here are the solutions, here is the contour of u, okay so here you see here is
where they have given the initial condition these are boundary conditions here minus 1 to 1 and
apart from this remember the ODE example that I gave you will have to actually take lots of
points between these they call collocation points these are the points where you want to make
sure that the differential equation is satisfied.
Now one other point of detail you want to make sure that you take mean loss, okay so obviously
if you have lots and lots of point on the inside and only a few points on the boundary you will get
a lot of loss from inside the domain and only very little from the boundary in order to avoid that
what they have done is to actually take mean of these points, mean of these points, mean of loss
of these points separately and once you do that it is a little bit balanced there are since still some
questions left for example people in my group are actually trying to find out there are small
problems that still remain within this PINN and we are trying to handle that within my group, but
apart from that it is a very well written code and very intelligent scheme okay.
1311
So once you do that you see here the solution notice here that blue is the exact solution which is
obtained from the kind of finite difference method that I told you actually in this case we even
know the analytical solution kind of okay and here is the prediction this performs extremely well
the code that they have given performs extremely well it turns out that finite-difference actually
will have some trouble for portions of this code PINN tends to do this extremely well without
any trouble at all. So this is a very very impressive performance they have actually gone on to do
some solutions of Navier Stoke's some inverse problems, unfortunately, we did not have time to
discuss that, that is a very big use of neural networks that you can put it to in order to do inverse
problems.
1312
Now apart from this they have also looked at Schrodinger equation so within physics
Schrodinger equation solutions are required within quantum mechanics so this is a solution of a
Schrodinger equation again this was splitted to real and complex part, imaginary part etc again
this is an initial value problem x t very similar you can see once again that the exact solution and
the predicted solution are extremely good.
So, in summary, this is a very novel application of neural networks one should say it does not
really have a clear training set, testing set no clear supervision of data at all, not clear it is not
there at all there is no supervision really required you just give the differential equation and the
loss term is figured out from there rather than from a labelled y hat, all you do is make sure that
your u hat satisfies the differential equation in the least square sense I expect that the applications
of this will grow with time, I do know several groups in the world are actually working on this
and hopefully we will see good developments with even commercial software using this kind of
idea in the future.
So what we have done in the past four videos is four different applications, one is sort of a
vanilla very very simple neural network application, two very modern CNN applications within
CAE computer-aided engineering and the last one which I expect to grow more and more in the
future actually the last three applications I expect to grow more and more in the future.
1313
Now that being said we did not do several applications in this course obviously due to paucity of
time and also the kind of medium that we are using we can do different things if people are here
in person and if you had good computational resources.
1314
One important thing that I talked about in the last video is something called Conv-LSTM it is
just to change from here. So Conv-LSTM Conv-LSTM is just sort of an image-based RNN, one
of the problems we had given in the exercises was you know with a scan if you have multiple
images and you have a video instead of a simple static image, how can you do RNN with that?
So one way is to take an original image, you know the final fully connected layer is actually
small and this can go into an RNN but this is called a CNN-LSTM, okay a Conv-LSTM is
slightly different wherever you had products such as you know remember Wh + Ux you actually
change it to a convolution, once again the basic idea is the same you have a sequence of images
one of the first applications within our field for Conv-LSTM was when we where in weather
prediction this is 2015 the basic idea was you use a sequence of radar images which kind of with
which you can kind of predict the amount of monsoon and try and predict how you know this
radar image will look like in the future, this is supposed to help in monsoon prediction, weather
prediction, etc.
I would encourage you to look at this these terms Conv-LSTM and put weather prediction and
you will find some good links there, the results are still preliminary I know that Indian Institute
of Tropical Meteorology it is also called IITM in Pune they are also trying this several institutes
within India are trying it and of course worldwide people have been now trying to incorporate
CNNs, LSTMs in order to make predictive forecasts about weather, rainfall etc.
1315
Often you will find CNN-LSTMs or other convolutional LSTMs being used in sort of trying to
predict the next frame within a video but that is not of great interest to us within let us say
engineering, engineering we are more interested in you have sequence of images and you want to
what happens next in terms of practical things like rainfall, etc.
Now the sky is the limit for the type of applications that are there for all the techniques that we
have discussed so far I will talk briefly about this in the next video, but if within this course,
unfortunately, we had time for only this I would highly encourage you to look at all the papers
that I refer to and see the references written those papers and also see who has referred to those
papers later on in order to get a very very wide bunch of applications, weather prediction
especially in the earth sciences I just recently went to a conference and saw the vast number of
the large amount of work which is being done in this area, okay. So that is it for the applications
for this week, I will summarize what all we have done in the course so far in the next video,
thank you.
1316
Summary and road ahead
Welcome back in this final video for the course I will just summarize what we were able to do
within the course, also what we were not able to do within this course and how you can move
forward within your journey in machine learning.
1317
So the topics that we covered primarily apart from mathematics was basically introductory
machine learning so this is just sort of a brief taste you got of machine learning, we just had 30
hours, we also (looked like) looked at some few applications, okay. So please do not take this
course as being a thorough course in machine learning you just got an overview, machine
learning is a vast subject, but hopefully you got some idea of what things are happening within
this field.
The first part of the course was deep learning and specifically we looked at ANNs, CNNs and
RNNs, once again all our discussions were kind of preliminary here we only looked at backprop
very very briefly, thorough courses and machine learning will actually go into it our point was to
be able to give you a flavour of what these things are, why they get into trouble and how they
work and how maybe you can play around with them so that when you want to apply it a lot of
black box tools are nowadays available even tensorflow and when you want to apply it hopefully
you should have some background in what these methods are.
So that is what we covered that was a good portion of the course, deep learning itself was a good
solid portion of the course. We also had brief discussions of classical machine learning (binat
trees) binary trees, random forests unfortunately we did not have time to discuss applications of
that Doctor Ganapathy’s team itself has done a lot of work on binary trees and random forests
within medical imaging, they have publications there too.
SVM are classic algorithms they are slowly going out of fashion, but as has happened throughout
history within not only machine learning but also in general numerical methods there are things
that go out of fashion and then come back, but unfortunately we did not have too much time to
discuss SVM, it is a very interesting algorithm in itself, very different from the kind of
algorithms that we saw, again we saw a brief introduction KNN and then of course we had
unsupervised learning algorithms such as K-means and PCA.
Now one thing that you can do when you face a practical problem, let us say you are looking at
just to give you an example weather prediction or monsoon prediction within India, okay. So
India itself has you know you can see that maybe up north will be different from west, will be
different from the east, will be different from the south, but maybe there are other natural
portions that kind of aggregate properly.
1318
Now instead of trying to predict whole the whole of the monsoon in the country at one shot you
could probably figure out that there are four different kind of networks that predict in a particular
area well, now how could you figure this out? If you try to put (the gate) the data and try and get
an unsupervised algorithm within there, okay when I say data I am being deliberately vague this
word data is overloaded but let us say you find out genuinely that there are four clusters four
types of behaviours maybe K-means can help you there, may be PCA can help you there, you
can think of unsupervised algorithms as sort of a precursor to even supervised algorithm so that
is one thing. So that is one place where you can apply unsupervised algorithms.
Naive Bayes again we discussed only very very briefly but oftentimes when your data is sparse
by sparse I mean it is you do not have too much of a data set, Naive Bayes can perform
sometimes surprisingly well, even though the assumptions are all actually unphysical it tends to
perform reasonably well. You also did several other topics there which I am NOT summarizing
here, but apart from this Doctor Ganapathy also covered variational auto-encoder and GANS
which are right now the quote, unquote trending topic within machine learning this is what is
known as generative models and generative models can be useful for many many many different
fields, unfortunately once again we did not have time to discuss applications of that hopefully we
will put up some extra videos after this course is over.
We also discussed a few applications in engineering and science specifically PDE based ones
that we looked at hopefully you can apply it to almost any field that you have looked at atleast in
terms of basic applications that you can do very quickly using machine learning we have
discussed a few applications.
1319
Now unfortunately the topics we did not cover are larger than the topics which we covered that is
going to be true of any course actually if you want you cannot have both breadth and depth, so
we actually sacrificed depth for breadth within this course. Now one common complaint and I
guess it is not really a complaint but it is a suggestion and we do understand it, we did not cover
coding, we specifically we did not cover coding in frameworks. For example how do you code in
tensorflow, keras, Pytorch, etc we did not have time. And that was honestly speaking that was
not the purpose of this course also, our purpose primarily was if you see some publication or if
you see a paper and lot of people are sharing their papers online, most of this field currently as it
is developing is there only in papers it is not there in a textbook a textbook takes too long to
write and to summarize and by the time the (move) field has moved elsewhere.
We are ourselves planning to write a textbook based on the kind of the material that we covered
so that you have fundamentals, okay. So a textbook can only cover fundamentals and this field as
rapidly as it is growing is basically within papers. So our purpose and if you are able to succeed
in this we will be very happy was that if you take a paper it does not look like all Greek and
Latin to you and you are able to figure out you know at least 50, 60 percent of any publication
the machine learning portions and then you are able to see how this person would have applied
or why this person would have applied this method, now why did this person apply an LSTM
here? Why why a convolutional LSTM here? Why a fully convolutional layer here? Why a
segmentation type of architecture and why an encoder, decoder architecture?
1320
This kind of idea if you are able to create both Doctor Ganapathy and I would be very happy
because then we would have succeeded in the purpose of our course, okay. Because once you
know this then you can implement practically anything because learning these frameworks does
not take too much time, atleast atleast the preliminary functionality of this is not very hard it is
not too hard for you to learn. People are putting up their codes it is the basics, it is the basic idea
that is a little bit hard to get, okay.
So that is one other thing that we did not cover and hopefully we are planning to put on some
extra videos and more about that later. We did not cover applications of many of the algorithms
that we actually talked about, we went through them really rapidly. For example well
convolutional LSTM I just talked about, GANs did not discuss, we did not discuss you know
MLE, MAP and bayesian how do you actually apply them you have just seen them in theory,
they have beautiful applications within inverse problems, I was planning on showing one such
application there but unfortunately time was too short and it would have become a little bit too
complex for you.
Naive Bayes, binary trees, random forests, SVMs, etc there are lot of them have actually got very
very nice applications in engineering and science but this course was too short for us to cover
that. One very important topic that we did not cover is reinforcement learning this is within
engineering and science, it is extremely useful in what is known as controls, how to control
something you know how to make sure that an inverted pendulum stays upright, controlling of an
aircraft etc etc.
So reinforcement learning is slowly it is slowly coming into this field, it has always had heavy
applications within the gaming sorry sort of video game playing that side Google has bought
deep mind which started with doing heavy reinforcement learning for several problems. Now
there is a full course by Professor Balaraman Ravindran on NPTEL, we would very highly
recommend that you go through that course, okay the videos are up online he is the expert on this
topic within India so we would highly recommend that you take a look at in case you are
interested in this field it is a vast field in itself, it would have taken us about a week at least to
give us give you the beginning ideas.
1321
But professor Balaraman Ravindran course is there right now available on NPTEL we would
highly recommend that. The last topic which is expected to get more and more popular in the
future is probabilistic graphical models we do not have time for this.
Now since we did not have time for a lot of topics we are actually planning to upload more
material which will be publicly available on YouTube or some such place if it is a video,
otherwise you know just on our websites if it is a you know written material etc etc. If you are
interested in this material or knowing whenever we put up this material whenever it is probably
we will start doing this this summer May, June, July if you are interested please contact us on
this form on this link here you will just be asked to put an email and if you are interested in some
specific video you can put that.
We have also put up the same link on the forums for those of you who are there currently on the
NPTEL forums. So please contact us here in case you are interested, okay.
1322
Finally for those of you who have finished this course, several of you have asked us both in our
live sessions, as well as on the forum, on what should you do next? Okay, so first is of course
that the field is moving extremely rapidly, if I tell you to solve problem A today it is quite
possible that somebody might have already solved it by the time I finished speaking, okay so it it
is important to know what your actual interest is.
Now our videos, our course was pitched towards people who are in engineering okay so
something like Chemical Engineering, Mechanical Engineering, maybe even Electrical
Engineering actually the course was neutral to all that except for the last few applications that we
discussed or in physics, chemistry and they are not traditional computer science students, okay.
So we are hoping that you already have some knowledge of some field reasonably well at the
very least at the undergraduate level. In that case you can start thinking about oh this was the
kind of application this person did for CFD, can I try something similar in my field?
Okay, so our experience with the live classes at IIT Madras have been that whenever we present
this course usually several ideas automatically come from the students and we hope that some
such ideas have come to your mind also, okay.
Now if you want to get better expertise in this ideally it is best to start with some paper that
seems interesting to you that is the most important thing please start with something that actually
looks interesting rather than something you know oh I would just like to do it just for the heck of
1323
it, okay so it is much better if you are actually interested with something. Find out some paper
with interesting results and ideally since you are starting, look at some paper who has put up who
have put up their code either on their website or typically most people will put up their codes on
GitHub and they would have given links to that within their paper.
In fact one of the applications we discussed right now have put up their code on GitHub the PDE
paper and that is a good place to start, you can just take a look at it and see how people have
done it, okay. Try to read the paper and without looking at the code try and replicate its results,
then as and when you get stuck that is when you should actually refer to how the people how the
researchers are actually utilized it and put up put it up on there and I have put up on their
website, okay.
So if you I iterate with this with multiple papers you will find yourself very rapidly gaining
confidence in your ability to take an idea and actually execute it as code, okay so that is
important. And once you have some confidence you can try Kaggle which is a sort of data
science computation platform, you can also program some projects ideas of your own from
scratch. Please remember something that we have been emphasizing multiple times any
application comes with two things, you have to have some amount of domain expertise and you
have to have some knowledge of machine learning at least enough to know that why something
would work and why something would not work and what problems you would encounter during
training.
So with this we will end this course, we apologize for all the glitches and hitches that happened
during this course, there were several problems both on the forums as well as in the assignments,
this is the first time that we are running this course on as a MOOC so there were some teaching
troubles hopefully it did not completely spoil your experience of the course and hopefully you
gained something. As I said in case you are able to find yourself understanding something that is
either discussed online or while reading a paper we would be happy that we have actually
succeeded in our aim for this course, thank you and good luck.
1324

PDF

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

PDF

Încărcat de

Drepturi de autor:

Formate disponibile

INDEX

S.No. Topic Page No.

(Refer Slide Time: 0:39)

However in practice, in order to encode it into an algorithm is actually a difficult task.

(Refer Slide Time: 4:31)

(Refer Slide Time: 6:42)

(Refer Slide Time: 8:24)

(Refer Slide Time: 9:24)

Week 4 and 5 is essentially neural networks. So it is possible to think of even linear

(Refer Slide Time: 11:43)

(Refer Slide Time: 13:11)

(Refer Slide Time: 21:32)

(Refer Slide Time: 24:28)

(Refer Slide Time: 30:12)

(Refer Slide Time: 31:08)

(Refer Slide Time: 34:20)

(Refer Slide Time: 36:02)

(Refer Slide Time: 38:16)

(Refer Slide Time: 40:31)

(Refer Slide Time: 0:33)

(Refer Slide Time: 3:18)

(Refer Slide Time: 4:02)

(Refer Slide Time: 7:02)

(Refer Slide Time: 9:20)

(Refer Slide Time: 11:51)

So trying to learn under such an environment is called reinforcement learning, we will be

(Refer Slide Time: 23:39)

(Refer Slide Time: 0:15)

In this video, we will be beginning our mathematical excursion, we will

Now, why is linear algebra useful in the context of machine learning, as

(Refer Slide Time: 8:40)

(Refer Slide Time: 10:58)

Now, this kind of representation, is extremely useful as we will see through-

Implications of this kind of representation, I will go back to something, I

So, if you put a 3600×3600, matrix up front, A×v1 is v2 . So the machine

A very important implication, especially for engineering applications, is

So, we will, actually intelligently use this, later on in some applications,

(Refer Slide Time: 0:16)

(Refer Slide Time: 1:10)

Now, a special type of addition, which we will be using, within machine

Multiplication, so of course, all of us are familiar with the matrix prod-

So just as an example, is flashed on your screen here, we have taken the

(Refer Slide Time: 7:12)

Of course, since, α is a scalar if you take transpose of this whole

(Refer Slide Time: 10:15)

A couple of other operators, that we will be looking at, first is of course

Inverse is a matrix, which gives you, when multiplied by the original

(Refer Slide Time: 12:08)

So, just as an example, of an inverse, if I take a random 3 × 3, matrix

(Refer Slide Time: 0:15)

(Refer Slide Time: 0:28)

(Refer Slide Time: 2:18)

length of this vector as √3 2 + 42 = 5 . So the usual notion of length a norm is denoted by

(Refer Slide Time: 6:10)

Mathematically, a norm is any function f that satisfies

(Refer Slide Time: 9:00)

infinity-norm or the maximum-norm.

January 25, 2019

Department of Mechanical Engineering, Indian Institute of Technology, Madras

Linear Combinations, Span, Linear Independence

these three vectors are not linearly independent.

January 25, 2019

Department of Mechanical Engineering, Indian Institute of Technology, Madras

is dened by all these eigenvectors and these eigenvectors purely do stretching.