Label Smoothing
Posted on
This note is for Müller, R., Kornblith, S., & Hinton, G. E. (2019). When does label smoothing help? Advances in Neural Information Processing Systems, 32.
- generalization and learning speed of a multi-class neural network can often be significantly improved by using soft targets that are a weighted average of the hard targets and the uniform distribution over labels
- smoothing the labels prevents the network from becoming over-confident
the paper show empirically that in additional to improving generalization,
- label smoothing improves model calibration which can significantly improve beam-search
- but if a teacher network is trained with label smoothing, knoledge distillation into a student network is much less effective
they visualize how label smoothing changes the representations learned by penultimate layer of the network
- label smoothing encourages the representations of training examples from the same class to group in tight clusters
- loss of information in the logits about resemblances between instances of different classes, which is necessary for distillation, but does not hurt generalization or calibration of the model’s predictions
contributions
- a novel visualization method based on linear projections of the penultimate layer activations. It provides intuition regarding how representations differ between penultimate layers of networks trained with and without label smoothing
- label smoothing implicitly calibrates learned models so that the confidences of their predictions are more aligned with the accuracies of their predictions
- label smoothing impairs distillation, i.e., when teacher models are trained with label smoothing
write the prediction of a neural network as a function of the activations in the penultimate layer as
\[p_k = \frac{\exp(x^Tw_k)}{\sum_{l=1}^L\exp(x^Tw_l)}\]where
- $p_k$ is the likelihood the model assigns to the $k$-th class
- $w_k$ represents the weights and biases of the last layer
- $x$ is the vector containing the activations of the penultimate layer of a neural network concatenated with 1 to account for the bias
for a network trained with hard targets, minimize the expected value of the cross-entropy between the true targets $y_k$ and the network’s outputs $p_k$ as in
\[H(y, p) = \sum_{k=1}^K -y_k\log p_k\]where $y_k$ is 1 for the correct class and 0 for the rest
for a network trained with a label smoothing of parameter $\alpha$, minimize the cross-entropy between the modified targets $y_k^LS$ and the networks’ outputs $p_k$, where
\[y_k^{LS} = y_k(1-\alpha) + \alpha/K\]