Sunteți pe pagina 1din 9

12/16/2019 Loss Functions Explained - Deep Learning Demystified - Medium

Loss Functions Explained


Harsha Bommana Follow
Sep 30 · 8 min read

In any deep learning project, configuring the loss function is one of the most important
steps to ensure the model will work in the intended manner. The loss function can give a
lot of practical flexibility to your neural networks and it will define how exactly the
output of the network is connected with the rest of the network.

There are several tasks neural networks can perform, from predicting continuous
values like monthly expenditure to classifying discrete classes like cats and dogs. Each
different task would require a different type of loss since the output format will be
different. For very specialized tasks, it’s up to us how we want to define the loss.

From a very simplified perspective, the loss function (J) can be defined as a function
which takes in two parameters:

1. Predicted Output

2. True Output

https://medium.com/deep-learning-demystified/loss-functions-explained-3098e8ff2b27 1/9
12/16/2019 Loss Functions Explained - Deep Learning Demystified - Medium

Neural Network Loss Visualization

This function will essentially calculate how poorly our model is performing by
comparing what the model is predicting with the actual value it is supposed to output. If
Y_pred is very far off from Y, the Loss value will be very high. However if both values are
almost similar, the Loss value will be very low. Hence we need to keep a loss function
which can penalize a model effectively while it is training on a dataset.

If the loss is very high, this huge value will propagate through the network while it’s
training and the weights will be changed a little more than usual. If it’s small then the
weights won’t change that much since the network is already doing a good job.

This scenario is somewhat analogous to studying for exams. If one does poorly in an
exam, we can say the loss is very high, and that person will have to change a lot of things
within themselves in order to get a better grade next time. However if the exam went
well, then they wouldn’t do anything very different from what they are already doing for
the next exam.

Now let’s look at classification as a task and understand how the loss functions work in
this case.

. . .

Classification Losses
When a neural network is trying to predict a discrete value, we can consider it to be a
classification model. This could be a network trying to predict what kind of animal is
present in an image, or whether an email is spam or not. First let’s look at how the output
is represented for a classification neural network.

https://medium.com/deep-learning-demystified/loss-functions-explained-3098e8ff2b27 2/9
12/16/2019 Loss Functions Explained - Deep Learning Demystified - Medium

Classi cation Neural Network Output Format

The number of nodes of the output layer will depend on the number of classes present in
the data. Each node will represent a single class. The value of each output node
essentially represents the probability of that class being the correct class.

Pr(Class 1) = Probability of Class 1 being the correct class

Once we get the probabilities of all the different classes, we will consider the class having
the highest probability to be the predicted class for that instance. First let’s explore how
binary classification is done.

Binary Classification
In binary classification, there will be only one node in the output layer even though we
will be predicting between two classes. In order to get the output in a probability format,
we need to apply an activation function. Since probability requires a value in between 0
and 1 we will use the sigmoid function which can squish any real value to a value
between 0 and 1.

https://medium.com/deep-learning-demystified/loss-functions-explained-3098e8ff2b27 3/9
12/16/2019 Loss Functions Explained - Deep Learning Demystified - Medium

Sigmoid Function Graph Visualization

As the input to the sigmoid becomes larger and tends to plus infinity, the output of the
sigmoid will tend to 1. And as the input becomes smaller and tends to negative infinity,
the output will tend to 0. Now we are guaranteed to always get a value between 0 and 1,
which is exactly how we need it to be since we require probabilities.

If the output is above 0.5 (50% Probability), we will consider it to be falling under the
positive class and if it is below 0.5 we will consider it to be falling under the negative
class. For example if we are training a network to classify between cats and dogs, we can
assign dogs the positive class and the output value in the dataset for dogs will be 1,
similarly cats will be assigned the negative class and the output value for cats will be 0.

The loss function we use for binary classification is called binary cross entropy (BCE).
This function effectively penalizes the neural network for binary classification task. Let’s
look at how this function looks.

https://medium.com/deep-learning-demystified/loss-functions-explained-3098e8ff2b27 4/9
12/16/2019 Loss Functions Explained - Deep Learning Demystified - Medium

Binary Cross Entropy Loss Graphs

As you can see, there are two separate functions, one for each value of Y. When we need
to predict the positive class (Y = 1), we will use

Loss = -log(Y_pred)

And when we need to predict the negative class (Y = 0), we will use

Loss = -log(1-Y_pred)

As you can see in the graphs. For the first function, when Y_pred is equal to 1, the Loss is
equal to 0, which makes sense because Y_pred is exactly the same as Y. As Y_pred
value becomes closer to 0, we can observe the Loss value increasing at a very high rate
and when Y_pred becomes 0 it tends to infinity. This is because, from a classification
perspective, 0 and 1 have to be polar opposites due to the fact that they each represent
completely different classes. So when Y_pred is 0 when Y is 1, the loss will have to be
very high in order for the network to learn it’s mistakes more effectively.

https://medium.com/deep-learning-demystified/loss-functions-explained-3098e8ff2b27 5/9
12/16/2019 Loss Functions Explained - Deep Learning Demystified - Medium

Binary Classi cation Loss Comparisons

We can mathematically represent the entire loss function into one equation as follows:

Binary Cross Entropy Full Equation

This loss function is also called as Log Loss. This is how the loss function is designed for
a binary classification neural network. Now let’s move on to see how the loss is defined
for a multiclass classification network.

Multiclass Classification
Multiclass classification is appropriate when we need our model to predict one possible
class output every time. Now since we are still dealing with probabilities it might make
sense to just apply sigmoid to all the output nodes so that we get values between 0–1 for
all the outputs, but there is an issue with this. When we are considering probabilities for
multiple classes, we need to ensure that the sum of all the individual probabilities is
equal to one, since that is how probability is defined. Applying sigmoid does not ensure
that the sum is always equal to one, hence we need to use another activation function.

The activation function we use in this case is softmax. This function ensures that all the
output nodes have values between 0–1 and the sum of all output node values equals to
1 always. The formula for softmax is as follows:

https://medium.com/deep-learning-demystified/loss-functions-explained-3098e8ff2b27 6/9
12/16/2019 Loss Functions Explained - Deep Learning Demystified - Medium

Softmax Formula

Let’s visualize this with an example:

Softmax Example Visualization

So as you can see, we are simply passing all the values into a exponential function. After
that, to make sure they are all in the range of 0–1 and to make sure the sum of all the
output values equals to 1, we are just dividing each exponential with the sum of all
exponentials.

So why do we have to pass each value through an exponential before normalizing them?
Why can’t we just normalize the values themselves? This is because the goal of softmax
is to make sure one value is very high (close to 1) and all other values are very low
(close to 0). Softmax uses exponential to make sure this happens. And then we are
normalizing because we need probabilities.

Now that our outputs are in a proper format, let’s go ahead to look at how we configure
the loss function for this. The good thing is that the loss function is essentially the same
as that of binary classification. We will just apply log loss on each output node with
respect to its respective target value and then we will find the sum of this across all
output nodes.

https://medium.com/deep-learning-demystified/loss-functions-explained-3098e8ff2b27 7/9
12/16/2019 Loss Functions Explained - Deep Learning Demystified - Medium

Categorical Cross Entropy Visualization

This loss is called as Categorical Cross Entropy. Now let’s move onto a special case of
classification called multilabel classification.

Multilabel Classification
Multilabel classification is done when your model needs to predict multiple classes as
the output. For example, let’s say you are training a neural network to predict the
ingredients present in a picture of some food. There will be multiple ingredients we need
to predict so there will be multiple 1’s in Y.

For this we can’t use softmax because softmax will always force only one class to become
1 and other classes to become 0. So instead we can simply keep sigmoid on all the output
node values since we are trying to predict each class’s individual probability.

As for the loss we can directly use log loss on each node and sum it, similar to what we
did in multiclass classification.

Now that we have covered classification, let’s now move on to regression.

Regression Loss
In regression, our model is trying to predict a continuous value. Some examples of
regression models are:

https://medium.com/deep-learning-demystified/loss-functions-explained-3098e8ff2b27 8/9
12/16/2019 Loss Functions Explained - Deep Learning Demystified - Medium

House price prediction

Person Age prediction

In regression models, our neural network will have one output node for every
continuous value we are trying to predict. Regression losses are calculated by
performing direct comparisons between the output value and the true value.

The most popular loss function we use for regression models is the mean squared error
loss function. In this we simply calculate the square of the difference between Y and
Y_pred and average this over all the data. Suppose there are n data points:

Mean Squared Error Loss Function

Here Y_i and Y_pred_i refer to the i’th Y value in the dataset and the corresponding
Y_pred from the neural network for the same data.

That concludes this article. Hopefully now you have a deeper understanding of how loss
functions are configured for various tasks in deep learning. Thank you for reading!

Machine Learning Deep Learning Mathematics Data Science Neural Networks

About Help Legal

https://medium.com/deep-learning-demystified/loss-functions-explained-3098e8ff2b27 9/9

S-ar putea să vă placă și