Focal Loss: learning it the easy way

Rishabh Agrahari
4 min readNov 24, 2020
Focal Loss Vs Cross Entropy Loss

Focal Loss was introduced by Tsung-Yi Lin et al. in their paper: Focal Loss for Dense Object Detection. Multiple blogs have explained the intuition behind Focal Loss, and they have done it well enough. Today, in this blog, we will try to get the intuition behind Focal Loss in another way, a more straightforward way.

Cross-Entropy Loss (CE)

To make things simple, let’s take the case of binary classification. Our network is giving single value sigmoid output. With p ∈ [0,1] being the predicted score, and y ∈ {±1} being the ground-truth label, we can calculate the Cross-Entropy Loss (CE) as:

We can also write the above equations can as follows:

This is how almost everyone looks at the cross-entropy loss; let’s try to look at it differently.

Essentially, because we have a Sigmoid activation at the end of the network, with network outputting a value between 0 and 1, the model is trained to predict the probability of the input being a sample from the positive class. Let’s take a look at pt, can it be said that pt denotes the predicted probability of the input sample belonging to its corresponding ground-truth class? Yes!

Didn’t get the last bit? Let’s see it with some examples. Suppose for a positive sample (y=+1), the model outputs 0.9, we say that the probability of the input sample belonging to the positive class is 0.9. Now, suppose for a negative sample (y=-1), the model outputs 0.4, we say that the probability of the input sample belonging to the positive class is 0.4, so the probability of it belonging to the negative class is 0.6 (i.e., 1-output). This is what pt denotes. pt denotes the predicted probability of the input sample belonging to the class it actually belongs to.

So, as the CE loss is a negative logarithmic function, the higher the pt the lower the Loss. Which ultimately makes sense, right? The more confident the model is for the input sample belonging to its corresponding class, the lower the loss!

Now we truly understand the cross-entropy loss, it’s time to jump to Focal Loss.

Focal Loss (FL)

Focal Loss adds a modulating factor to the Cross Entropy loss. Here’s the equation:

Gamma (γ) ∈ [0, 5], let’s take γ = 2.0 for understanding the equation.

Now, focal Loss deals with the huge class imbalance issue faced by one-stage object detectors like the SSD and RetinaNet by focusing on the hard examples and preventing the model from getting overwhelmed with lots of easy negative examples. How does it do that? As we understand what pt signifies, we should be able to guess the effect of the modulating factor on the Loss. Let’s ask a few questions.

Ques: What’s an easy example?

Ans: Where pt is high (say, 0.9).

Ques: What happens to modulating factor in cases of easy example?

Ans: It reduces the loss value. (1-0.9)² = 1/100th of the original loss value had it been CE loss instead of FL. This shows that the losses accumulated by the easy examples (which are abundant as compared to hard examples) will not be significant enough.

Ques: What’s a hard example?

Ans: Where pt is too low (say, 0.1)

Ques: What happens to the modulating factor in cases of hard examples?

Ans: It increases the loss value. (1–0.1)² = 81/100th of the original loss value had it been CE loss instead of FL. Also, the loss in this case is 81x the loss computed for the easy example case! This shows that the model is penalized heavily for wrong predictions in case of hard examples.

Did you notice we didn’t even talk about the ground truth class labels while understanding the effect of modulating factor? That’s the key. It’s all about easy and hard examples. There you go!

The paper also proposes another form of focal loss which assigns preset weights to the classes to further tackle the challenge of class imbalance.

α-balanced Focal Loss

Still got doubts? Shoot them in the comments!

--

--

Rishabh Agrahari

I teach machines what to do with their lives. Head of AI Delivery @ Tvarit GmbH