WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Fair Risk Control for Calibrating Multi-group Fairness Risks

Posted on
Tags: Large Language Model, Fairness

This note is for Zhang, L., Roth, A., & Zhang, L. (2024). Fair Risk Control: A Generalized Framework for Calibrating Multi-group Fairness Risks. Proceedings of the 41st International Conference on Machine Learning, 59783–59805.

based on the notion of multi-calibration, introduce $(s, \cG, \alpha)$-GMC (Generalized Multi-Dimensional Multicalibration) for multi-dimensional mappings $s$, contraint set $\cG$, and a pre-specified threshold level $\alpha$

  • propose associated algorithms to achieve this notion in general settings

this framework is then applied to diverse scenarios encompassing different fairness concerns, including

  • false negative rate control in image segmentation
  • prediction set conditional uncertainty quantification in hierarchical classification
  • de-biased text generation in language models

Introduction

  • a common theme across the fairness is that some measure of error or risk should be equalized across sub-populations
  • common measures evaluated across demographic groups include false positive and false negative rates and calibration error
  1. initial work in this line gave methods for equalizing different risk measures on disjoint groups
  2. a second generation of work gave methods for equalizing measures of risk across groups even when the groups could intersect

in general, distinct algorithms are derived for each of these settings, and they are generally limited to one-dimensional predictors of various sorts

this work proposes a unifying framework for fair risk control in settings with multi-dimensional outputs, based on multicalibration

  • multicalbiration was introduced as a fairness motivated constraint that informally asks that a 1-dimensional predictor of a binary-valued outcome be unbiased, conditional on both its own prediction and on membership of the input in some number of pre-defined groups
  • later, multicalibration has been generalized in a number of ways
    • generalize multicalibration to real-valued outcomes, and defines and studies a variant of multicalibration that predicts variance and higher moments rather than means
    • extends the study of multicalibration of both means and moments to the online setting, and defines a variant of multicalibration for quantiles, with applications to uncertainty estimation
    • gives more practical variants of quantile multicalibration with applications to conformal prediction, together with experimental evaluation
    • gives an abstract generalization of 1-dimensional multicalibration, and show how to cast other algorithms fairness desiderata like false positive rate control in this framework
    • gives a characterizations of the scope of 1-dimensional multicalibration variants via a connection to property elicitation: informally, a property of a distribution can be multicalibrated if and only if it minimizes some 1-dimensional separable regression function
  • another line of work generalizes multicalibration in an orthogonal direction, leaving the outcomes binary valued but generalizing the class of checking rules that are applied
    • defines outcome indistinguishability, which generalizes multicalibration to require indistinguishability between the predicted and true label directions with respect to a fixed but arbitrary set of distinguishers
    • defines “smooth calibration” that relaxes calibration’s conditioning event to be smooth function of the prediction
    • defines a hierarchy of relaxations called low-degree multicalibration that further relaxes smooth calibration and demonstrate desirable statistical properties
    • define notions of calibration tailored to the objective function of a downstream decision maker

These lines of work are part of a more general literature studying multi-group fairness

Notation

  • $\cX$: feature domain
  • $\cY$: label domain
  • $\cD$: joint (feature, label) data distribution

  • $A$: finite set
  • $\Delta A$: simplex over $A$ respectively, $\Delta A = {(p_1,\ldots, p_{\vert A\vert}): 0\le p_i\le 1, \sum_{i=1}^\vert A\vert p_i = 1}$

3. Formulation and Algorithm

3.1 A generalized notion of Multicalibration

  • $h(x) \in \cH$ denote a multi-dimensional scoring function associated with the input
    • e.g., $h(x)\in \IR^k$ ($k$ is the number of pixels) is intended to approximate the probability of a pixel being part of a relevant segment
    • or in text generation tasks, $h(x)$ is the distribution over the vocabulary produced by a language model given context $x$
  • for $x\in \cX$, consider an objective function $f: \cX\rightarrow\cF\subset \IR^m$, defined as $f(x) = (f_1(x), \ldots, f_m(x))$, where $\cF$ is a convex set.
    • e.g., in text generation tasks, $f(x)$ is the calibrated distribution over the output vocabulary and is multi-dimensional (with dimension equal to the vocabulary size)
    • in binary classification tasks where $h$ and $f$ are both scalars, $f(x)$ is the threshold used to convert the raw score $h(x)$ into binary predictions, i.e., $1_{h(x) > f(x)}$

write $s(f, x, h, y, \cD): \cQ\times \cX\times \cH\times \cY\times \cP\rightarrow \IR^l$ to denote a mapping functional of interest, where $\cD$ is the joint distribution of $(x, h, y)$ and $\cP$ is the distribution space.

write $\cG$ to denote the class of functions that encode demographic subgroups (along with other information) and for each $g\in \cG, g(f(x), x)\in \IR^l$, consistent with the dimension of $s(f, x, h, y, \cD)$ so that we can calibrate over every dimension of $s$

Let $x, h, y, \cD$ denote the feature vector, the scoring function, the label vector, and the joint distribution of $(x, h, y)$ respectively. Given a function class $\cG$, mapping functional $s$, and a threshold $\alpha > 0$, we say $f$ satisfies $(s, \cG, \alpha)$-Generalized Multicalibration $(s,\cG,\alpha)$-GMC if \(\bbE_{(x, h, y)\sim \cD}[\langle s(f, x, h, y, \cD), g(f(x), x) \rangle] \le \alpha, \forall g\in\cG\)

3.2 Algorithms and Convergence Results

3.3 Finite-Sample Results

4 Applications

4.1 De-Biased Text Generation

  • $x\in \cX$: a prompt
  • $h(x)\in\Delta \cY$: language model generates the probability vector
  • sample a word (output) following $o(x)\sim h(x)$

objective: post-processes $h(x)$ to a probability distribution $p(x)\in \Delta\cY$ that has better fairness properties in specific ways

  • $p$ is initialized at $h$
  • at the high level, aim to produce $p(x)$ so that the probabilities of certain groups of words remain the same whether the prompt includes male-indicating words or female-indicating words
  • asstribut set $U$ as a collection of specific sensitive words
  • $\cU$ is the set of all $U$, which stands for different kinds of sensitive words

measure the bias of the model on sensitive attribute $U$ by

\[\vert P(o(x)\in U\mid x\in F) - P(o(x)\in U\mid x\in M)\vert\]

suppose the marginal distribution over prompt $x$ satisfies that $P(x\in F), P(x\in M)\ge \gamma$ for some positive constant $\gamma > 0$, we get

\begin{align} & \vert P(o(x)\in U\mid x\in F) - P(o(x)\in U\mid x\in M)\vert
& \le \vert P(o(x)\in U\mid x\in F) - P(o(x) \in U)\vert + \vert P(o(x)\in U\mid x\in M) -P(o(x)\in U)\vert
& = \frac{1}{P(x\in F)}P(x\in F)\vert P(o(x)\in U\mid x\in F) - P(o(x) \in U)\vert + \frac{1}{P(x\in M)}\vert P(o(x)\in U\mid x\in M) -P(o(x)\in U)\vert
&\le \frac{1}{\gamma} \left( \vert P(x\in F)\vert P(o(x)\in U\mid x\in F) - P(o(x) \in U)\vert + \vert P(x\in M)}\vert P(o(x)\in U\mid x\in M) -P(o(x)\in U) \vert \right) \end{align
}

as a result, we only need to control the terms on the right hand side

specifically, we want to calibrate the output so that for any subset $U\in \cU\subset \cY$ (e.g., gender-stereotyped professions) and subgroups $A\in \cA\subset \cX$ (e.g., gender-related pronouns)

\[\vert P(x\in A)\cdot [P(o(x)\in U\mid x\in A) - P(o(x)\in U)]\le\alpha\]

4.2 Prediction-Set Conditional Coverage in Hierarchical Classification

in a $K$-class hierarchical text classification problem, the input $x\in \cX$ is a document and the label is a leaf node $y$ on a classification tree with nodes $V$ and edges $E$

  • set $V = {1,2,\ldots, \vert V\vert}$
    • first $K$ indices ${1,\ldots, K}$ denote leaf nodes, so the groundtruth label $y\in{\ldots, K}$
    • the tree is of depth $H$

for a given single-class classification model $h: x\rightarrow [0, 1]^K$, let $u(x) \triangleq \argmax_k h_k(x)$ denote the candidate with the highest score over all leaf nodes according to $h$

define some useful functioal notation

  • $\bA: V\rightarrow V^H$ return the set of all the ancestor nodes of the input node
  • $q: V\times V\rightarrow V$ compute the nearest common ancestor of its two input nodes
  • $R: \cX\rightarrow \IR^{\vert V\vert}$ be the function that computes for each node $i$, $R_i$, the sum of the raw score $h(x)$ assigned to each leaf that is a descendent of node $i$ (or itself if $i$ is a leaf)
  • if randomization, let $r_i(x) = R_i(x) + \epsilon_i(x)$, where $\epsilon(x)$ is an independent random variable with zero-mean and constant variable.

define

\[o(x) \triangleq \argmin_v\{d(v): v\in \bA(u(x)), r_v < \lambda(x)\}\]

where

  • $d(v)$ denotes the depth of the node $v$ in the tree
  • $\lambda(x)$ denotes a threshold function

the goal is to find a $\lambda(x)$, such that the rate at which the output covers the label is roughly equal to a given target $\sigma$, not just overall, but conditional on the prediction set we output lying in various sets $\cU\subset 2^V$:

\[\vert \bbE_{(x, h, y)\sim \cD}[1_{o(x)\in U}(\sigma - 1_{o(x)\text{ covers }y})]\vert \le \alpha, \foral U\in \cU\]

4.3 Fair FNR Control in Image Segmentation

in image segmentation when $O, O’$ denotes the set of the actual selected segments and the predicted segments respectively,

\[\text{FNR} = 1 - \frac{\vert O\cap O'\vert}{\vert O\vert}\]
  • $x\in\cX$: the input, which includes both image and demographic group information
  • $y\in {0, 1}^m$: the label, which is a binary vector denoting the true inclusion of each of the $m$ pixels

to yield the prediction of $y$, namely $\hat y\in {0, 1}^m$, a scoring function $h(x)\in \IR^m$ and a threshold function $\lambda(x)$ are needed, so that

\[\hat y_i = 1_{h_i(x) > \lambda(x)}\]

Let $\cA$ be the set of sensitive subgroups of $\cX$, the goal is to ensure that the FNR across different groups are approximately $(1-\sigma)$ for some prespecified $\sigma > 0$:

\[\vert \bbE_{(x, h, y)\sim \cD}[1_{x\in A}(1 - \frac{\vert O\cap O'\vert}{\vert O\vert} - \sigma)] \le \alpha, \qquad\forall A\in \cA\]

the object can be converted to

\[\sup_{A\in \cA}\vert\bbE_{(x, h, y)\sim\cD} 1_{x\in A}(1 - \frac{\sum_{i=1}^my_i1_{h_i(x) > \lambda(x)}}{\sum_{i=1}^my_i} - \sigma)]\vert \le \alpha\]

5. Experiments


Published in categories