WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Statistical Impossibility and Possibility of Aligning LLMs with Human Preferences

Posted on
Tags: Large Language Model, Alignment

This note is for Liu, K., Long, Q., Shi, Z., Su, W. J., & Xiao, J. (2025). Statistical Impossibility and Possibility of Aligning LLMs with Human Preferences: From Condorcet Paradox to Nash Equilibrium (No. arXiv:2503.10990). arXiv.

statistical impossibility and possibility of aligning LLMs with Human Preferences: From Condorcet Paradox to Nash Equilbrium

this paper uncover fundamental statistical limits concerning aligning LLMs with human preference, with a focus on the probabistic representation pf human preferences

  • show that human preferences can be represented by a reward model iff the preference among LLM-generated responses is free of any Condorcet cycle.
  • prove that Condorcet cycles exist with probability converging to one exponentially fast under a probabilistic preference model,
    • thereby demonstrating the impossibility of fully aligning human preferences using reward-based approaches such as reinforcement learning from human feedback
  • explore under which LLMs would employ mixed strategies (??? what is mixed strategy)
    • they do not collapse to a single response——when aligned in the limit using a non-reward-based approach, such as Nash learning from human feedback (NLHF)

Introduction

  • LLMs learn to interact with human users and accommodate diverse opinions and values by aligning output with human preferences through RLHF
  • RLHF begins by constructing a reward model trained on preference data from human labelers using the Bradley-Terry-Luce (BTL) model

  • from a statistical perspective, a critical question is whether reward models can sufficiently capture all possible human preferences structures

A Statistical Impossibility

identify a fundamental impossibility result regarding the representational capacity of reward models for human preferences

a human preferences structure (among multiple responses) cannot be represented by a reward model iff a Condorcet cycle exits

a Condorcet cycle of length $r\ge 3$ consists of $r$ responses ${y_i}{i=1}^r$ such that each $y_i$ is preferred over $y{i+1}$ by a majority of labelers, with the cyclic convention $y_{r+1}=y_1$

it is evident that a reward-induced preference structure cannot admit Condorcet cycles——because a response with the highest score (assuming no ties) cannot be beaten by any other response——show that the existence of Condorcet cycle is also necessary for any human preferences structure that cannot be represented by a reward model

the paper demonstrates that Condorcet cycles emerge under mild conditions, establishing that reward models, including the BTL model, are fundamentally incapable of fully capturing human preferences for LLM alignment.

labelers provided rankings of $n$ responses to a prompt when OpenAI collected preference data, and follow the classical assumption in soical choice theory that these preference rakings are independent and random

  • under this model, the paper proves that the probability of observing at least one Condorcet cycle is bounded below by $1-\exp(\text{poly}(n))$ when the number of labelers $m\ge 3$, where $\text{poly}(n)$ denotes a positive polynomial in the number $n$ of responses

Statistical Possibility

given the impossibility of fully capturing human preferences using reward models, a natural alternative is to directly model the pairwise probability that one response is preferred over another

one promising alternative is Nash Learning from Human Feedback (NLHF), recently introduced by Google DeepMind

NLHF employs a pairwise preference modeling strategy by using a neural network to estimate the probability of one response being preferred over another

the alignment process is then carried out through a two-player game, where two LLMs are trained to generate responses, each aiming to maximize the probability that its own response is preferred over the other’s.

Any optimal solution of NLHF must be a Nash equilibrium of this two-player game.

does an NLHF-aligned LLM diversify its outputs, or, preserve minority preferences, rather than “collapse” to a single response for a prompt? - this is important because generating only a single response (even if widely preferred) can suppress minority preferences, leading to biases and fairness concerns, particularly when prompts solicit opinions rather than objective facts

  • empirically, RLHF-aligned LLMs often exhibit substantial bias toward certain preferred responses
    • RLHF, unless regularized, can induce “preference collapse” due to its reward-based nature, where the LLM tends to generate the response with the highest reward with a probability higher than expected, even when the reward model captures human preferences.

the paper derives a necessary and sufficient condition under which any optimal solution of NLHF is a mixed strategy——that is, the NLHF-aligned LLM generates at least two distinct responses with positive probability, thereby avoiding collapse to a single response.

prove that a mixed strategy emerges iff there is no winning response, which is defined as a response that a majority of labelers prefer over every other response, often referred to as the winner in the literature of social choice theory

the necessity of this condition is intuitive: if a Condorcet winning response exists, then a best response in NLHF trivially selects that single response.

under the same probabilistic ranking model above, they show that the probability of a Condorcet winning response is small, thereby illustrating the provable statistical possibility of NLHF in preserving miniority preferences

Summary of Contributions

  • whether human preferences can be captured by reward models depends on the presence of Condorcet cycles.
  • in scenarios where Condorcet cycles exist (with high probability), an NLHF-aligned LLM would output a mixed strategy if no Condorcet winning response is present, and would otherwise output the winning response
  • in contrast, an LLM aligned using standard RLHF with a reward model, if fully optimized in fine-tunning, is prone to collapsing to a single response, irrespective of whether a Condorcet winning response exists or not
  • the Condorcet winning response does not exist with high probability under the probabilistic ranking model, the practical implication is that RLHF-aligned LLMs tend to collapse, whereas NLHF is more likely to generate diverse response
  • while NLHF can theoretically capture any preferences, if faces computational challenges due to the complexity of solving a two-player game with general pairwise preferences
  • while reward models cannot fully capture human preferences, reward functions that take the policy distribution of the LLM as input can effectively represent any preferences
  • establish that the two-player game in NLHF is equivalent to a maximization problem involving a policy reward. Employt a rejection sampling technique to obtain an unbiased estimator of the policy reward gradient
  • the resulting algorithm, called Nash Rejection Sampling, uses a single-loop approach to effectively find the Nash equilibrium of NLHF

  • LLMs regarding fairness and potential biases: one contributing factor to these biases arises from RLHF
    • conventional RLHF methods often converge to a single preference type, a phenomenon known as preference collapse
  • several preference fine-tuning methods have been proposed for RLHF
    • proximal policy optimization (PPO) does not fully leverage the advantages of RLHF in aligning LLMs with human preferences
  • a popular alternative to reward-based RLHF is direct preference optimization (DPO)
    • DPO fine-tunes LLMs directly on human preference data, eliminating the need to train a reward model, which makes it computationally more efficient
    • recent studies suggest that DPO is less effective than reward-based RLHF for aligning LLMs
  • in addition to NLHF, recent research has explored the formulation of two-player constant-sum games for aligning human preferences
  • rejection sampling scheme has been used heuristically in many RLHF techniques
  • in social choice theory, the probabilities of the existence of Condorcet cycles and Condorcet winning responses have been studied for decades under various assumptions
    • these studies primarily focus on electoral settings, where the number of candidates (response $n$) is relatively small, while the number of voters (labelers) $m$ is large and is often assumed to tend to infinity in analysis.
    • analysis in the case of a small or finite $n$, which is the focus here, is significantly more challenging.
  • Section 2: demonstrate the limitation of reward models for representing human preferences
  • Section 3: investigate under what conditions NLHF would preserve minority preferences
  • Section 4: discuss the policy reward of NLHF and propose a new algorithm to solve it

Section 2: Impossibility of Reward Models for Preference Representation

analyze the statistical limits of using reward models to capture human preferences

  • human preferences can be modeled by a reward model iff there does not exist a Condorcet cycle in the preference structure
  • under mild assumptions, the probability of a Condorcet cycle emerging approaches one exponentially as the number of responses increases

2.1 Preliminaries of RLHF

  • $\pi(y\mid x)$: LLM distribution in response $y$ given a prompt $x$
  • $\pi_{\text{ref}}$: reference LLM
  • $\rho$: a distribution over prompts
    • a fixed database of prompts or adaptively updated based on prior responses
    • adaptively updated based on prior responses

in RLHF, human preferences are typically modeled using the BTL model, which assumes the existence of a reward model $r(x, y)$ such that:

\[P(y\succ y'\mid x) = \frac{\exp(r(x, y))}{\exp(r(x, y)) + \exp(r(x, y'))}\]

the standard RLHF objective is

\[\max_\pi \bbE_{x\sim \rho}[\bbE_{y\sim \pi(\cdot\mid x)}r(x, y) - \tau \text{KL}(\pi(\cdot\mid x)\Vert \pi_{ref}(\cdot\mid x)) ]\]

2.2 Reward Model and Condorcet Cycle

Given a prompt $x$ and its associated $n$ responses $y_1,\ldots, y_n$, an individual (labeler) expresses a preference in the form of a ranking of these $n$ responses {.thm title=”Assumption 2.1 (Linear Preference Ranking)”}

assume that each individual has a rational preference, meaning they maintain a strict ranking over the $n$ responses.

for any distinct responses $y$ and $y’$, one say that $y$ is preferred over $y’$ if $P(y\succ y’) > 1/2$.

Given Assumption 2.1, each individual independently samples a linear preference ranking with equal probability $1/n!$ {.thm title=”Assumption 2.2 (Impartial Culture Condition)”}

at its core, a reward model should reflect how labelers prefer certain responses, meaning that whenever the preference model satisfies $P(y\succ y’) > 1/2$, the reward must satisfy $r(y) > r(y’)$

A reward model $r: \cY \rightarrow \IR$ captures a preference $P(y\succ y’)$ if for any two distinct responses $y$ and $y’$, $r(y) > r(y’)$ implies $P(y\succ y’) > 1/2$

from a decision-theoretic perspective, the definition establish an equivalence between two decision-making paradigms

  • one induced by the utility function (reward model)
  • another defined by the preference relation

given a reward model, for any three distinct responses, $y_{i_1}, y_{i_2}, y_{i_3}$,

  • if $r(y_{i_1}) > r(y_{i_2}$ and $r(y_{i_2}) > r(y_{i_3})$, then $r(y_{i_1}) > r(y_{i_3})$ by transitivity

therefore, any preference model induced by a reward model must be acyclic (or transitive):

if $P(y_{i_1} \succ y_{i_2}) > \frac 12$ and $P(y_{i_2} \succ y_{i_3}) > 1/2$, then $P(y_{i_1} \succ y_{i_3}) ? \frac 12$ must hold

many real-world preferences are cyclic, ad demonstrated by Condorcet paradox

Suppose one-third of the population prefers $y_1\succ y_2\succ y_3$, one-third prefers $y_1\succ y_3\succ y_1$, and the remaining third prefers $y_3\succ y_1\succ y_2$. In this case we have \(P(y_1\succ y_2) = P(y_2\succ y_3) = P(y_3\succ y_1) = \frac 23\) therefore this preference relationship is cyclic

Condorcet paradox demonstrate that cyclic preferences can arise from simple scenarios: just three labelers with different preferences can produce a cyclic group preference

A sequence of responses $y_{i_1}, \ldots, y_{i_r}$ forms a Condorcet $r$-cycle if $P(y_{i_p} \succ y_{i_{p+1}} ) > 1/2$ for all $p=1,\ldots, r$, where $y_{i_{r+1}} = y_{i_1}$ by convention

For any set of responses ${y_1,\ldots, y_n}$ with $n\ge 3$ and any preference $P(y\succ y’)$ defined on this set, there exists a reward model $r(y)$ that captures the preference $P(y\succ y’)$ iff there is no Condorcet cycle in the set of responses.

2.3 Probability of Condorcet Cycle

Let $\alpha_1$ represent the proportion of labelers who prefer $y_1\succ y_2\succ y_3$, $\alpha_2$ the proportion who prefer $y_2\succ y_3\succ y_1$, $\alpha_3$ the proportion who prefer $y_3\succ y_2\succ y_1$, …, where $\alpha_i\ge 0$ for $i=1,\ldots, 6$, and $\alpha_1 + \ldots + \alpha_6 = 1$. Then, the overall preference relationship is cyclic iff the following conditions hold:

Suppose the population satisfies Assumption 2.2. Let the number of responses $n\ge 3$ and the number of labelers $m\ge 3$. Then \(\bP_{m, n}(\text{Condorcet cycle}) \ge 1 - c_1e^{-c_2n},\) where $c_1, c_2 > 0$ are universal constant.

As $n\rightarrow \infty$, the probability of Condorcet cycles existing approaches one expoentially at a rate of $1-e^{-\text{poly}(n)}$

As a result, human preferences cannot be accurately captured by a reward model with high probability

this demonstrates the necessity of using preference models directly, even under the idealized assumption of rational labelers. Moreover, since people typically do not exhibit perfect rationality in practice, the case for direct preference modeling becomes even stronger

Section 3: Possibility of Preserving Diverse Preferences

NLHF: use pairwise preference and is thus non-reward-based

in particular, NLHF can represent any possible human preferences. We are interested in whether this enhanced preference representation would bring benefits in preserving diverse preference.

examine whether NLHF can avoid preference collapse——a phenomeon observed in RLHF


Published in categories