WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Flexible Imbalance-Faireness-Aware Classifiers

Posted on
Tags: Large Language Model, Fairness

This note is for Deng, Z., Zhang, J., Zhang, L., Ye, T., Coley, Y., Su, W. J., & Zou, J. (2022). FIFA: Making Fairness More Generalizable in Classifiers Trained on Imbalanced Data (No. arXiv:2206.02792). arXiv.

empirically, this imbalance leads to a lack of generalizability not only of classification, but also of fairness properties, especially in over-parameterized models

the paper proposes a theoretically-principled, yet Flexible approach that is Imbalance-Fairness-Aware (FIFA)

sufficiently trained ResNet-10 models generalize well on classification error but poorly on fairness constraints

algorithmic fairness, where practical algorithms are developed for pre-processing, in-processing, and post-processing steps.

2 Background

consider datasets consisting of triplets of the form $(x, y, a)$, where $x\in \cX$ is a feature vector, $a\in \cA$ is a sensitive attribute such as race and gender, $y\in \cY$ is the corresponding label.

the underlying random triplets corresponding to $(x, y, a)$ is denoted as $(X, Y, A)$

the goal is to learn a predictor $h\in\cH: \cX\mapsto \cY$, where $h(X)$ is a prediction of the label $Y$ of input $X$

in this paper, it mainly considers equalized odds (EO), and the method can also be directly used to equalized opportunity (EqOpt)

in addition, under certain conditions, the method can also be used to demographic parity (DP)

  • equalized odds (EO) and equalized opportunity (EqOpt): a predictor $h$ satisfies equalized odds if $h(X)$ is conditionally independent of the sensitive attribute $A$ given $Y$:
\[P(h(X) = y\mid Y=y, A=a) = P(h(X) = y\mid Y = y)\]

if $\cY = {0, 1}$, and we only require

\[P(h(X) = 1\mid Y=1, A=a) = P(h(X)=1\mid Y=1)\,,\]

we say $h$ satisfies equalized opportunity.

  • demographic parity (DP). A predictor $h$ satisifies demographic parity if $h(X)$ is statistically indepndent of the sensitive attribute $A$: $P(h(X) = Y\mid A = a) = P(h(X) = Y)$

3 Theory-inspired derivation

consider the supervised $k$-class classification problem, where a model $f: \cX\mapsto \IR^k$ provides $k$ scores, and the label is assigned as the class label with the highest score.

\[h(x) = \argmax_i f(x)_i\]

use

\[\cP_i = \cP(X\mid Y = i)\]

$\cP_i = \cP(X\mid Y = i)$ to denote the conditional distribution when the class label is $i$ for $i\in[k]$

$\cP_{bal}$: the balanced distribution $\cP_{Idx}$, where $\text{Idx}$ is uniformly drawn from $[k]$.

$\cP_{i, s} = \cP(X\mid Y=i, A=s)$: conditional distribution when $Y = i$ and $A = s$

for the training dataset ${(x_j, y_j, a_j)}j$, let $S_i = {j: y_j=i}, S{i, a} = {j: y_j=i, a_j=a}$, and the corresponding sample sizes be $n_i$ and $n_{i, a}$, respectively.

the goal is to ensure

\[\cL_{bal}[f] =P_{(x, y)\sim \cP_{bal}}[f(x)_y < \max_{l\neq y} f(x)_l]\]

and the fairness violation error to be as small as possible

margin trade-off between classes of equalized odds

in the setting of standard classification with imbalanced training datasets, the aim is to reach a small balanced test error $\cL_{bal}[f]$

but, in a fair classification setting, the aim is not only to reach a small $\cL_{bal}[f]$, but also satisfy certain fairness constraints at test time. Specifically, for EO, the aim is

\[\min_f \cL_{bal}[f]\\ \text{s.t. }\forall y\in \cY, a\in \cA, P(h(X) = y\mid Y=y, A=a) = P(h(X) = y\mid Y = y)\]

where recall that $h(\cdot) = \argmax_i f(\cdot)_i$

their performance criterion for optimization is

\[\cL_{bal}[f] + \alpha \cL_{fv}\]

where $\cL_{fv}$ is a measure of fairness constraints violation

start with $\cY = {0, 1}$ and $\cA = {a_1, a_2}$.

if all the training samples are classified perfectly by $h$, not only

\[\bbP_{(x, y)\sim \hat\cP_{bal}}(h(x)\neq y) =0\]

is satisfied, we also have that

\[\bbP_{(x, y)\sim \hat\cP_{i, a_j}}(h(x)\neq y) = 0\]

for all $i\in\cY$ and $a_j\in \cA$.

note that

\[P(h(X) = i\mid Y = i, A=a) = 1- P_{(x, y)\sim \cP_{i, a}}(h(x)\neq y)\]

denote the margin for class $i$ by

\[\gamma_i = \min_{j\in S_i}\gamma(x_j, y_j)\,,\]

where

\[\gamma(x, y) = f(x)_y - \max_{l\neq y} f(x)_l\]

one natural way to choose $\cL_{fv}$ is to take

\[\sum_{i\in \cY}\vert P(h(X) = i\mid Y=i, A=a_1) - P(h(X) = i\mid Y = i, A=a_2)\vert\]

Thm 3.1 tells how sample sizes of each subgroups are taken into account and how they affect the optimal ratio between class margins.

4 Flexible combination with logits-based losses

\[\gamma_{i, a} = \gamma_i + \delta_{i, a}, \qquad \delta_{i, a} \ge 0\]

consider a logits-based loss

\[\ell((x, y); f) = \ell(f(x)_y, \{f(x)_i\}_{i\in \cY\backslash y})\]

which is non-increasing with respect to its first coordinate if one fix the second coordinate. Such losses include

  • 0-1 loss
  • Hinge loss
  • Softmax-cross-entropy

the flexible imbalance-fairness-aware (FIFA) approach modifies the above losses by enforcing margin,

\[\ell_{FIFA}((x, y, a); f) = \ell(f(x)_y - \Delta_{y, a}, \{f(x)_i\}_{i\in \cY\backslash y})\]

5 Example: combining FIFA with reductions-based fair algorithms

  • Exponentiated gradient (ExpGrad) that produces a randomized classifier
  • Grid search (GridS) that produces a deterministic classifier


Published in categories