WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Label Smoothing

Posted on
Tags: Label Smoothing, Classification, Neural Network

This note is for Müller, R., Kornblith, S., & Hinton, G. E. (2019). When does label smoothing help? Advances in Neural Information Processing Systems, 32.

  • generalization and learning speed of a multi-class neural network can often be significantly improved by using soft targets that are a weighted average of the hard targets and the uniform distribution over labels
  • smoothing the labels prevents the network from becoming over-confident

the paper show empirically that in additional to improving generalization,

  • label smoothing improves model calibration which can significantly improve beam-search
  • but if a teacher network is trained with label smoothing, knoledge distillation into a student network is much less effective

they visualize how label smoothing changes the representations learned by penultimate layer of the network

  • label smoothing encourages the representations of training examples from the same class to group in tight clusters
  • loss of information in the logits about resemblances between instances of different classes, which is necessary for distillation, but does not hurt generalization or calibration of the model’s predictions

contributions

  • a novel visualization method based on linear projections of the penultimate layer activations. It provides intuition regarding how representations differ between penultimate layers of networks trained with and without label smoothing
  • label smoothing implicitly calibrates learned models so that the confidences of their predictions are more aligned with the accuracies of their predictions
  • label smoothing impairs distillation, i.e., when teacher models are trained with label smoothing

write the prediction of a neural network as a function of the activations in the penultimate layer as

\[p_k = \frac{\exp(x^Tw_k)}{\sum_{l=1}^L\exp(x^Tw_l)}\]

where

  • $p_k$ is the likelihood the model assigns to the $k$-th class
  • $w_k$ represents the weights and biases of the last layer
  • $x$ is the vector containing the activations of the penultimate layer of a neural network concatenated with 1 to account for the bias

for a network trained with hard targets, minimize the expected value of the cross-entropy between the true targets $y_k$ and the network’s outputs $p_k$ as in

\[H(y, p) = \sum_{k=1}^K -y_k\log p_k\]

where $y_k$ is 1 for the correct class and 0 for the rest

for a network trained with a label smoothing of parameter $\alpha$, minimize the cross-entropy between the modified targets $y_k^LS$ and the networks’ outputs $p_k$, where

\[y_k^{LS} = y_k(1-\alpha) + \alpha/K\]

Published in categories