WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Entropy Regularization

Posted on
Tags: Entropy Regularization, Classification, Neural Network

This note is for Pereyra, G., Tucker, G., Chorowski, J., Kaiser, Ł., & Hinton, G. (2017). Regularizing Neural Networks by Penalizing Confident Output Distributions (No. arXiv:1701.06548). arXiv. https://doi.org/10.48550/arXiv.1701.06548

systematically explore regularizing neural networks by penalizing low entropy output distributions

show that penalizing low entropy output distributions, which has been shown to improve exploration in RL, acts as a strong regularizer in supervised learning

connect a maximum entropy based confidence penalty to label smoothing through the direction of the KL divergence

Introduction

  • despite using large datasets, neural networks are still prone to overfitting
  • numerous techniques have been proposed to prevent overfitting, e.g., early stopping, L1/L2 regularization (weight decay), dropout, and batch normalization.
    • these techniques act on the hidden activations or weights of a neural network
  • regularizing the output distribution of large, deep neural networks has largely been unexplored.
  • the probabilities assigned to incorrect classes are an indication of how the network generalizes
    • distillation exploits this fact by explicitly training a small network to assign the same probabilities to incorrect classes as a large network or ensemble of networks that generalizes well
  • the paper systematically evaluated two output regularizers: a maximum entropy based confidence penalty and label smoothing

Directly Penalizing Confidence

a neural network produces a conditional distribution $p_\theta(y\mid x)$ over classes given an input $x$ through a softmax function

the entropy is given by

\[H(p_\theta(y\mid x)) = -\sum_i p_\theta(y_i\mid x)\log p_\theta(y_i\mid x)\]

to penalize confident output distributions, add the negative entropy to the negative log-likelihood during training

\[L(\theta) = -\sum\log p_\theta(y\mid x) - \beta H(p_\theta(y\mid x))\]

where $\beta$ controls the strength of the confidence penalty.

Anneling and Thresholding the confidence penalty

In RL, penalizing low entropy distributions prevents a policy network from converging early and encourages exploration

However, in supervised learning, we typically want quick convergence, while preventing overfitting near the end of training, suggesting a confidence penalty that is weak at the beginning of training and strong near convergence. A simple way to achieve this is to anneal the confidence penalty

another way to strengthen the confidence penalty as training progresses is to only penalize output distributions when they are below a certain entropy threshold. We can achieve this by adding a hinge loss to the confidence penalty, leading to an objective of the form

\[L(\theta) = -\sum \log p_\theta(y\mid x) - \beta \max(0, \Gamma - H(p_\theta(y\mid x)))\]

where $\Gamma$ is the entropy threshold below which we begin applying the confidence penalty

Connection to Label Smoothing

When the prior label distribution is uniform, label smoothing is equivalent to adding the KL divergence between the uniform distribution $u$ and the network’s predicted distribution $p_\theta$ to the negative log-likelihood

\[L(\theta) = -\sum \log p_\theta(y\mid x) - D_{KL}(u\Vert p_\theta(y\mid x))\]

Published in categories