Label Smoothing

Posted on Feb 22, 2025

Tags: Label Smoothing, Classification, Neural Network

This note is for Müller, R., Kornblith, S., & Hinton, G. E. (2019). When does label smoothing help? Advances in Neural Information Processing Systems, 32.

generalization and learning speed of a multi-class neural network can often be significantly improved by using soft targets that are a weighted average of the hard targets and the uniform distribution over labels
smoothing the labels prevents the network from becoming over-confident

the paper show empirically that in additional to improving generalization,

label smoothing improves model calibration which can significantly improve beam-search
but if a teacher network is trained with label smoothing, knoledge distillation into a student network is much less effective

they visualize how label smoothing changes the representations learned by penultimate layer of the network

label smoothing encourages the representations of training examples from the same class to group in tight clusters
loss of information in the logits about resemblances between instances of different classes, which is necessary for distillation, but does not hurt generalization or calibration of the model’s predictions

contributions

a novel visualization method based on linear projections of the penultimate layer activations. It provides intuition regarding how representations differ between penultimate layers of networks trained with and without label smoothing
label smoothing implicitly calibrates learned models so that the confidences of their predictions are more aligned with the accuracies of their predictions
label smoothing impairs distillation, i.e., when teacher models are trained with label smoothing

write the prediction of a neural network as a function of the activations in the penultimate layer as

\[p_k = \frac{\exp(x^Tw_k)}{\sum_{l=1}^L\exp(x^Tw_l)}\]

where

$p_k$ is the likelihood the model assigns to the $k$-th class
$w_k$ represents the weights and biases of the last layer
$x$ is the vector containing the activations of the penultimate layer of a neural network concatenated with 1 to account for the bias

for a network trained with hard targets, minimize the expected value of the cross-entropy between the true targets $y_k$ and the network’s outputs $p_k$ as in

\[H(y, p) = \sum_{k=1}^K -y_k\log p_k\]

where $y_k$ is 1 for the correct class and 0 for the rest

for a network trained with a label smoothing of parameter $\alpha$, minimize the cross-entropy between the modified targets $y_k^LS$ and the networks’ outputs $p_k$, where

\[y_k^{LS} = y_k(1-\alpha) + \alpha/K\]

Published in categories

← previous next →

See all posts →

WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Label Smoothing

Posted on Feb 22, 2025