Fantastic Generalization Measures and Where to Find Them
Posted on
The post is based on Jiang, Y., Neyshabur, B., Mobahi, H., Krishnan, D., & Bengio, S. (2019). Fantastic Generalization Measures and Where to Find Them. ArXiv:1912.02178 [Cs, Stat].which was shared by one of my friend in the WeChat Moment, and then I took a quick look.
A number of theoretically and empirically motivated complexity measures has been proposed due to generalization of deep networks has been of great interests in recent years.
open question: whether the conclusion drawn from those experiments would remain valid in other settings
the paper:
- present the first large scale study of generalization in deep networks
- investigate more than 40 complexity measures taken from both theoretical bounds and empirical studies
- train over 10000 convolutional networks by systematically varying commonly used hyperparameters
- carefully controlled experiments and show surprising failures of some measures as well as promising measures
Introduction
generalization bound: an upper bound on the test error based on some quantity that can be calculated on the training set.
PAC-Bayesian bounds can be optimized to achieve a reasonably tight generalization bound, current bounds are still not tight enough to accurately capture the generalization behavior.
Others propose direct empirical ways to characterize generalization, but empirical correlation does not necessarily translate to a causal relationship between a measure and generalization.
complexity measure: a quantity that monotonically relates to some aspects of generalization, lower complexity should often imply smaller generalization gap. - theoretically motivated complexity measures: VC-dimension, norm of parameters - empirically motivated complexity measures: sharpness
Despite the prominent role of complexity measures in studying generalization, the empirical evaluation of these measures is usually limited to a few models, often on toy problems. A measure can only be considered reliable as a predictor of generalization gap if it is tested extensively on many models at a realistic problem size.
The paper selected a wide range of complexity measures from the literature: - measures motivated by generalization bounds related to VC-dimension, norm or margin based bounds, and PAC-Bayesian bounds - measures such as sharpness, Fisher-Rao norm, and path norms
The paper train more than 10000 models over two image classification datasets, CIFAR-10 and Street View House Numbers (SVHN).
Key findings:
- many norm-based measures not only perform poorly, but negatively correlate (gap with complexity??) with generalization specifically when the optimization procedure injects some stochasticity.
- sharpness-based measures such as PAC-Bayesian bounds and sharpness measure perform the best overall and seem to be promising candidates for further research
- measures related to the optimization procedures such as the gradient noise and the speed of the optimization can be predictive of generalization
Generalization
The core question in generalization is what causes the triplet of a model, optimization algorithm, and data properties, to generalize well beyond the training set.
Some potential approaches to compare different complexity measures:
- tightness of generalization bounds
- regularizing the complexity measure
- correlation with generalization:
- another pitfall: drawing conclusion from changing one or two hyper-parameters. In these cases, the hyper-parameter could be the true cause of both change in the measure and change in the generalization, but the measure itself has no causal relationship with generalization. (confounding effects?)
The paper focus on the third approach. While acknowledging all limitations of a correlation analysis, the authors try to improve the procedure and capture some of the causal effects as much as possible through careful design of controlled experiments.
The goal is to analyze how consistent a measure (such as $\ell_2$ norm of network weights) is with the empirically observed generalization.