WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Sample Difficulty from Pre-trained Models

Posted on
Tags: Pre-trained Models, Sample Difficulty

This note is for Cui, P., Zhang, D., Deng, Z., Dong, Y., & Zhu, J. (2023). Learning Sample Difficulty from Pre-trained Models for Reliable Prediction. Advances in Neural Information Processing Systems, 36, 25390–25408.

Learning Sample Difficulty from Pre-trained Models for Reliable Prediction

challenges:

  • how to leverage large-scale pre-trained models to improve the prediction reliability of downstream models is undesirably under-explored.
  • modern neural networks have been found to be poorly calibrated and make overconfident predictions regardless of inherent sample difficulty and data uncertainty

to address the issue, the paper proposes to utilize large-scale pre-trained models to guide downstream model training with sample difficulty-aware entropy regularization

pre-trained models that have been exposed to large-scale datasets and do not overfit the downstream training classes enables to measure each training sample’s difficulty via feature-space Gaussian modeling and relative Mahalanobis distance computation.

by adaptively penalizing overconfident prediction based on the sample difficulty, they simultaneously improve accuracy and uncertainty calibration across challenging benchmarks

Introduction

the community has reached a consensus that by exploiting the data, pre-trained models can learn to encode rich data semantics that is promised to generally beneficial for a broad spectrum of applications, e.g.,

  • warming up the learning on downstream tasks with limited data
  • improving domain generalization or model robustness
  • enabling zero-shot transfer

the paper investigates on a new application:

  • leveraging pre-trained models to improve the calibration and the quality of uncertainty quantification of downstream models, both of which are crucial for reliable model deployment in the wild

model training on the task dataset often encounters ambiguous or even distorted samples.

  • these samples are difficult to learn from —— directly enforcing the model to fit them may cause undesirable memorization and overconfidence

this issue comes from loss formulation and data annotation, thus using pre-trained models for finetuning or knowledge distillation with the cross-entropy loss will NOT solve the problem

it is necessary to subtly adapt the training objective for each sample according to its sample difficulty for better generalization and uncertainty quantification

the paper exploits pre-trained models to measure the difficulty of each sample in the downstream training set and use this data annotation to modify the training loss

sample difficulty has been shown to be beneficial to improve training efficiency and neural scaling laws, but its effectiveness for ameliorating model reliability is largely under-explored

pre-trained model aid in scoring sample difficulty by shifting the problem from the raw data space to a task- and model-agnostic feature space, where simple distance measures suffice to represent similarities

large-scale multi-modal datasets and self-supervised learning principles enable the pre-trained models to generate features that sufficiently preserve high-level concepts behind the data and avoid overfitting to specific data or classes

the paper perform sample difficulty estimation in the feature space of pre-trained models and cast it as a density estimation problem since samples with typical discriminative features are easier to learn and typical features shall reappear

  • the paper advocates using Gaussian model to represent the training data distribution in the feature space of pre-trained models
  • and derive the relative Mahalanobis distance (RMD) as a sample difficulty score

Equipped with the knowledge of sample difficulty, the paper proposes to use it for regularizing the prediction confidence

Image

  • the standard cross entropy loss with a single ground truth label would encourage the model to predict all instances as the class Tusker with 100% confidence
  • however, such high confidence is definitely unjustified for the hard samples
  • thus, the paper modifies the training loss by adding an entropic regularizer with an instance-wise adaptive weighting in proportion to the sample difficulty

the method successfully improves the model’s performance and consistently outperforms competitive baselines on various image classification benchmarks, ranging from the i.i.d. setting to corruption robustness, selective classification, and out-of-distribution detection.

unlike previous works that compromise accuracy or suffer from expensive computational overhead, the method can improve predictive performance and uncertainty quantification concurrently in a computationally efficient way

the consistent gains across architectures demonstrate that the sample difficulty measure is a valuable characteristic of the dataset for training

Uncertainty regularization

uncertainty regularization to alleviate the overfitting and overconfidence of deep neural networks.

  • $L_p$ norm and entropy regularization (ER) explicitly enforce a small norm of the logits or a high predictive entropy
  • Label smoothing (LS) interpolates the ground-truth one-hot label vector with an all-one vector, thus penalizing 100% confident predictions (???)
  • focal loss is proposed to be as an upper bound to ER
  • correctness ranking loss (CRL) regularizes the confidence based on the frequency of correct predictions during the training losses

LS, $L_p$ norm, FL and ER do not adjust prediction confidence based on the sample difficulty

the frequency of correct predictions was interpreted as a sample difficulty measure. However, CRL only concerns pair-wise ranking in the same batch.

Compared to the prior art, the paper weights the overconfidence penalty according to the sample difficulty, which proves to be more effective at jointly improving accuracy and uncertainty quantification.

the sample difficulty is derived from a data distribution perspective, and remains constant over training

Sample difficulty measurement

  • as deep neural networks are prone to overfitting, they often require careful selection of training epochs, checkpoints, data splits and ensembling strategies

the paper explores pre-trained models that have been exposed to large-scale datasets and do not overfit the downstream training set

informative prior knowledge for ranking downstream training samples

the attained sample difficulty can be readily used for different downstream tasks and models

the paper focuses on improving prediction reliability, which is an under-explored application of large-scale pre-trained models.

the derived relative Mahalanobis distance (RMD) outperforms Mahalanobis distance (MD) and K-means clustering for sample difficulty quantification

Sample Difficulty Quantification

the sample difficulty quantification (i.e., characterizing the hardness and noiseness of samples) is pivotal for reliable learning of the model

for measuring training sample difficulty, the paper proposes to model the data distribution in the feature space of pre-trained model and derive a RMD

a small RMD implies that

  1. the sample is typical and carries class-discriminative features (close to the class-specific mean mode but far away from the class-agnostic mean mode)
  2. there exist many similar samples (high-density area) in the training set

Such a sample represents an easy case to learn, i.e., small RMD <-> low sample difficulty

Large-scale pre-trained models

  • large-scale image and image-text data have led to high-quality pre-trained vision models for down-stream tasks
  • instead of using them as the backbone network for downstream tasks, the paper proposes a new use case, i.e., scoring the sample difficulty in the training set of the downstream task
  • easy-to-learn samples shall reappear in the form of showing similar patterns
  • repetitive patterns specific to each class are valuable cues for classification. They contain neither confusing nor conflicting information.
  • single-label images containing multiple salient objects belonging to different classes or having wrong labels would be hard samples

to quantify the difficulty of each sample, the paper proposes to model the training data distribution in the feature space of large-scale pre-trained models (???)

  • in the pixel space, data distribution modeling is prone to overfitting low-level features
    • an outlier sample with smoother local correlation can have a higher probability than an inlier sample
  • on the other hand, pre-trained models are generally trained to ignore low-level information, e.g., semantic supervision from natural language or class labels (???)

in the case of self-supervised learning, the proxy task and loss are also formulated to learn a holistic understanding of the input images beyond low-level image statistics,

  • the masking strategy designed in MAE prevents reconstruction via exploiting local correlation

as modern pre-trained models are trained on large-scale datasets with high sample diversities in many dimensions,

  • they learn to preserve and structure richer semantic features of the training samples than models only exposed to the training set that is commonly used at a smaller scale

compared to classical LDA, for me, the most interested one is the usage of pre-trained model!!

in the feature space of pre-trained models, one expects easy-to-learn samples will be closely crowded together

hard-to-learn ones are far away from the population and even sparsely spread due to missing consistently repetitive patterns.

from a data distribution perspective, the easy (hard)-to-learn samples should have high (low) probability values.

Measuring sample difficulty

Class-conditional and agnostic Gaussian distributions

  • downstream training set $\cD = {(x_i, y_i)}_{i=1}^N$: a collection of image-label pairs
    • $x_i \in \IR^d$ and $y_i\in {1,\ldots, K}$

model the feature distribution of ${x_i}$ with and without conditioning on the class information

  • $G(\cdot)$ denotes the penultimate layer output of the pre-trained model $G$

the class-conditional distribution is modelled by fitting a Gaussian model to the feature vectors ${G(x_i)}$ belonging to the same class $y_i = k$

\[P(G(x)\mid y = k) = N(G(x) \mid \mu_k, \Sigma)\] \[\mu_k = \frac{1}{N_k}\sum_{i:y_i=k}G(x_i)\,, \quad \Sigma = \frac 1N\sum_k \sum_{i:y_i = k}(G(x_i) - \mu_k)(G(x_i) - \mu_k)^\top\]

the class-agnostic distribution is obtained by fitting to all feature vectors regardless of their classes

\[P(G(x)) = N(G(x)\mid \mu_{agn}, \Sigma_{agn})\] \[\mu_{agn} = \frac 1N\sum_{i=1}^N G(x_i)\,,\quad \Sigma_{agn} = \frac 1N\sum_{i=1}^N (G(x_i) - \mu_{agn})(G(x_i) - \mu_{agn})^\top\]

Relative Mahalanobis distance (RMD)

for scoring sample difficulty, propose to use the difference between the Mahalanobis distances respectively induced by the class-specific and class-agnostic Gaussian distribution

\[RM(x_i, y_i) = M(x_i, y_i) - M_{agn}(x)i\] \[M(x_i, y_i) = -(G(x_i) - \mu_{y_i})'\Sigma^{-1}(G(x_i) - \mu_{y_i})\] \[M_{agn}(x_i) = -(G(x_i) - \mu_{agn})'\Sigma_{agn}^{-1}(G(x_i) - \mu_{agn})\]

a small class-conditional MD $M(x_i, y_i)$ indicates that the sample exhibits typical features of the sub-population. However, some features may not be unique to the sub-population, i.e., common features across classes, yielding a small class-agnostic MD $M_{agn}(x_i)$

an easier-to-learn sample should have small class-conditional MD but large class-agnostic MD. the derived RMD is thus an improvement over the class-conditional MD for measuring the sample difficulty

How well RMD scores the sample difficulty

hard samples tend to be challenging to classify, as they miss the relevant information or contain ambiguous information

prior works also show that data uncertainty should increase on poor-quality images

the paper manipulates the sample difficulty by corrupting the input image or alternating the label

RMD score increases proportionally with the severity of corruption and label noise, which further shows the effectiveness of RMD in characterizing the hardness and noiseness of samples

use RMD to sort each ImageNet1k validation sample in the descending order of the sample difficulty and group them into subsets of equal size

Image

Image

the paper hypothesizes that self-supervised learning (MAE-ViT-B) plays a positive role compared to supervised learning (ViT-B), as it does not overfit the training classes and instead learn a holistic understanding of the input images beyond low-level image statistics aided by a well-designed loss

Difficulty-aware Uncertainty Regularization

propose a sample difficulty-aware entropy regularization to improve the training loss, which will be shown to yield more reliable predictions

  • $f_\theta$: the classifier parameterized by a deep neural network, which outputs a conditional probability distribution $p_\theta(y\mid x)$
  • train $f_\theta$ by minimizing the cross-entropy loss on the training set
\[\ell =\bbE_{(x_i, y_i)}[-\log(f_\theta(x_i)[y_i])]\]

where $f_\theta(x_i)[y_i]$ refers to the predictive probability of the ground-truth label $y_i$

  • deep neural networks trained with cross-entropy loss tend to make overconfident predictions
  • besides, the ground-truth label is typically a one-hot vector which represents the highest confidence regardless of the sample difficulty

although different regularization-based methods have been proposed to address the issue, they all ignore the difference between easy and hard samples

  • and may assign an inappropriate regularizer for some samples

Published in categories