Focal Loss

Posted on Feb 22, 2025

Tags: Focal Loss, Classification, Neural Network

This post is for Mukhoti, J., Kulharia, V., Sanyal, A., Golodetz, S., Torr, P., & Dokania, P. (2020). Calibrating Deep Neural Networks using Focal Loss. Advances in Neural Information Processing Systems, 33, 15288–15299.

a mismatch between a model’s confidence and its correctness of DNNs make their predictions hard to rely on
ideally, want networks to be accurate, calibrated and confident

the paper shows that

as opposed to the standard cross-entropy loss, focal loss allows one to learn models that are already very well calibrated
when combined with temperature scaling, whilst preserving accuracy, it yields state-of-the-art calibrated models
provide a thorough analysis of the factors causing miscalibration, and justify the empirically excellent performance of focal loss
for the use of focal loss, provide a principled approach to automatically select the hyperparameter involved in the loss function
perform experiments on a variety of computer vision and NLP datasets, and with a wide variety of network architectures

Introduction

many multi-class classification networks are poorly calibrated, in the sense that the probability values that they associate with the class labels they predict overestimate the likelihoods
the underlying cause is hypothesised to be that these networks’ high capacity leaves them vulnerable to overfitting on the negative log-likelihood loass
much work has been inspired by approaches that were not originally formulated in a deep learning context, such as
- Platt scaling
- histogram binning
- isotonic regression
- Bayesian binning and averaging
various works have begun to directly target the calibration of deep networks
a modern variant of Platt scaling known as temperature scaling, dividing a network’s logits by a scalar $T > 0$
- Drawbacks: whilst it scales the logits to reduce the network’s confidence in incorrect predictions, this also slightly reduces the network’s confidence in predictions that were correct; moreover, it is known that temperature scaling does not calibrate a model under data distribution shift
Maximum Mean Calibration Error (MMCE): minimize a differentiable proxy for calibration error at training time
training models on cross-entropy loss with label smoothing instead of one-hot labels

the paper proposes to replace the cross-entropy loss with the focal loss

cross-entropy minimises the KL divergence between the predicted (softmax) distribution and the target distribution (one-hot encoding in classification tasks) over classes
focal loss minimises a regularized KL divergence between these two distributions, which ensures minimisation of the KL divergence whilst increasing the entropy of the predicted distribution, thereby preventing the model from becoming overconfident.
focal loss depends on a hyperparameter $\gamma$, that needs to be cross-validated, also provide a method for choosing $\gamma$ automatically for each sample, and show that it outperforms all the baseline models

Problem Formulation

denote

$y_i\in \cY = {1, 2,\ldots, K}$
$\hat p_{i, y} = f_\theta(y\mid \bfx_i)$: predicts for a class $y$
$\hat y_i = \argmax_y \hat p_{i, y}$
the predicted confidence: $\hat p_i = \max_y \hat p_{i, y}$

the network is said to be perfectly calibrated when for each sample $(\bfx, y)\in D$, the confidence $\hat p$ equals to the model accuracy $\bbP(\hat y = y \mid \hat p)$

expected calibration error (ECE):

\[\bbE_{\hat p}[\vert \bbP(\hat y = y\mid \hat p) - \hat p\vert]\]

in practice, divide the interval $[0, 1]$ into bins to calculate

$B_i$: set of samples with confidence belonging to the $i$-th bin

\[A_i = \frac{1}{\vert B_i\vert} \sum_{j\in B_i} 1(\hat y_j = y_j)\qquad C_i = \frac{1}{\vert B_i\vert} \sum_{j\in B_i} \hat p_j\]

then

\[ECE = \sum_{i=1}^M \frac{\vert B_i\vert}{N} \vert A_i-C_i\vert\]

the maximum calibration error (MCE)

\[MCE = \max_{i\in \{1,\ldots, M\}} \vert A_i - C_i\vert\]

AdaECE: bin sizes are calculated to evenly distribute samples between bins

\[AdaECE = \sum_{i=1}^M \frac{\vert B_i\vert}{M} \vert A_i - C_i\vert\quad \text{s.t.}\quad \vert B_i\vert = \vert B_j\vert\]

Classwise-ECE: the ECE metric only considers the probability of the predicted class, without considering the other scores in the softmax distribution

\[ClasswiseECE = \frac{1}{K}\sum_{i=1}^M\sum_{j=1}^K\frac{\vert B_{ij}\vert}{N}\vert A_{ij} - C_{ij}\vert\]

where

\[A_{ij} = \frac{1}{\vert B_{ij}\vert }\sum_{k\in B_{ij}}1(j = y_k)\quad C_{ij} = \frac{1}{\vert B_{ij}\vert} \sum_{k\in B_{ij}}\hat p_{kj}\]

Improving Calibration using Focal Loss

for classification tasks where the target distribution is a one-hot encoding, it is defined as

\[\cL_f = -(1-\hat p_{i, y_i})^\gamma \log \hat p_{i, y_i}\]

cross-entropy forms an upper bound on the KL-divergence

\[\cL_c \ge KL(q\Vert \hat p)\]

while focus loss is an upper bound on the regularized KL-divergence

\[\cL_f \ge KL(q\Vert \hat p) -\gamma \bbH(\hat p)\]

Published in categories

← previous next →

See all posts →

WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Focal Loss

Posted on Feb 22, 2025

Introduction

Problem Formulation

Improving Calibration using Focal Loss