Debiasing Watermarks via Maximal Coupling

Posted on Oct 03, 2025

Tags: Large Language Model, Gumbel-max Trick, Watermarking, Maximal Coupling

This note is for Xie, Y., Li, X., Mallick, T., Su, W., & Zhang, R. (2025). Debiasing Watermarks for Large Language Models via Maximal Coupling. Journal of the American Statistical Association, 0(0), 1–11.

it presents a green/red list watermarking approach that partitions the token set into green and red lists, subtly increasing the generation probability for green tokens

to correct token distribution bias, the method employs maximal coupling, using a uniform coin flip to decide whether to apply bias correction, with the result embedded as a pseudorandom watermark signal

Introduction

the earliest watermarking scheme, the Gumblel-max watermark. This method selects the next token $w_t$ using the decoder

\[w_t = S(P_t, \xi_t) = \argmax_{w\in \cW} \log \frac{U_w}{P_w}\]

assuming $\xi_t = (U_1,\ldots, U_{\vert \cW\vert})$ is a random vector consisting of $\vert \cW\vert$ iid copies of $U[0, 1]$, serving as part of the watermark key.

while this watermarked decoder theoretically preserves the original next-token distribution when $\xi_t$ is truly random and achieves high detection efficiency, this decoder can generate repetitive text and significantly degrade text quality in practice.

the main reason is that, for existing implementations, the pseudorandom variable $\xi_t$ depends deterministically on the prior $k$ tokens. As a result, when conditioned on previous tokens, the decoder deterministically outputs the next token, which can lead to sub-optimal text generation

another is the green/red list watermark, employs a stochastic decoder.

it first partitions the token vocabulary $\cW$ into the green list $\cG$ and the red list $\cR$ (pseudo)randomly, and shares the green list $\cG$ with the detector as part of the watermark key.

the watermark is embeded by prompting the use of green tokens during decoding: given the NTP distribution $P_t$ and the green list $\cG$, the watermarked decoder constructs an alternative distribution $Q_t$ and samples the next token from $Q_t$.

the simplest version is the hard watermark, where each sampled token is limited in the green list $\cG$. Then it is more likely from $Q_t$ than $P_t$, thus, one can detect this difference by counting the number of green tokens in a given text.

however, limiting sampling to a vocabulary subset distorts the NTP distribution $P_t$ and can degrade LLM generation significantly.
despite these potential drawbacks, this watermarked decoder remains stochastic, as additional randomness is introduced during sampling from $Q$, while pseudorandomness is only used to set the green list. This extra randomness in the sampling process helps avoid repetitive generation

each of the two watermarking schemes has its own strengths

on one hand, like the Gumbel-max watermark, we want the decoder $S$ to faithfully follows the original NTP distributions——a property knowns as unbiasedness
on the other hand, similar to the green/red list watermark, we seek a stochastic decoder $S$ alleviate the negative impact of the pseudorandom variables on text generation

can we achieve the best of both worlds?

the paper introduces a novel watermarking scheme that leverage maximal coupling to balance these goals

it reminds me the cross-validation on graphon

the process begins by drawing a standard uniform random variable $\zeta$ and compute $p = \sum_{w\in\cW} \min(P_{t,w}, Q_{t, w})$.

if $\zeta < p$, sample from the overlapping region
otherwise, sample from the excess distribution

to aid detection, we share the information $\xi_t = (\cG, \zeta_t)$, the green list $\cG$ and the random variable $\zeta_t$, with the detector.

$H_0$: the text $w_{1:n}$ is written by human; that is, $\zeta_t\mid (P_t, \cG)\sim_{iid} U[0, 1]$ for $t=1,\ldots, n$
$H_1$: the text $w_{1:n}$ is written by a language model; that is, $\zeta_t\mid (P_t, \cG)\sim U[0, P_{t,\cG}]$ for $t=1,\ldots, n$
$H_1^{(mix)}$: the text $w_{1:n}$ is first generated by a language model and then modified; that is $\zeta_t\mid (P_t,\cG)\sim (1-\varepsilon_n)U[0, 1] + \varepsilon_n U[0, P_{t, \cG}]$ for $t = 1,\ldots, n$

they focus on the sparse watermark signal case to facilitate theoretical analysis, $\varepsilon_n\asymp n^{-p}$ and $1 - P_{t,\cG} \asymp n^{-q}$ decay with $n$ at different rates

Without sufficient watermark signals, that is, $p + 2q > 1$, $H_0$ and $H_1^{(mix)}$ merge asymptotically. Hence, no test can reliably separate $H_0$ and $H_1^{(mix)}$

Given sufficient watermark signals, that is $p+2q < 1$, $H_0$ and $H_1^{(mix)}$ separate asymptotically. In this case, the alternative hypothesis can be reliably detected using the likelihood ratio test and higher criticism.

with slightly stronger watermark signals, that is $p + q < 0.5$, the sum test, which rejects $H_0$ if the sum of the $\zeta_t$’s is too large, can reliably separate $H_0$ and $H_1^{(mix)}$ asymptotically.

another application of maximal coupling is speculative decoding, where a smaller draft model generates speculative token predictions that are then accepted or corrected by a more powerful target model. The paper apply this setup to the watermarking scheme and demonstrate that the method effectively retains the watermark signal despite the targeted attempts to modify the text.

2. Preliminaries

maximal coupling is a powerful tool in probability theory that constructs a coupling between two random variables to maximize the probability that they are equal

3. Watermarking Scheme via Maximal Coupling

a watermarking scheme has two parts

3.1 Decoding

3.2 Detection

Detection as Hypothesis Testing
Sum Test
Higher Criticism
Practical Considerations: the probability of sampling a token from a green list can be arbitrarily close to 1, or even equal to 1

3.3 Watermark Key

in cases where the watermarked text has a predetermined maximum length $n$, the language model provider can pre-generate both the green lists $\cG_{1:n}$ and uniform random variables $\zeta_{1:n}$, sharing them with the detector as the watermark key

for varying lengths, the language model provider supplies two distinct pseudorandom functions as the watermark key

the first generates the green lists
the second generates the uniform random variables

3.4 Interpret Speculative Decoding as Post-processing

4 Simulation Studies

two distinct scenarios

situations where each green token carrying the watermark signal exhibits a sufficient signal strength, that is, only require $P_{t, \cG_t} \ge m^{-r}$
scenario deals with a weak signal condition, that is, $P_{t, \cG_t} = 1-m^{-q}$

compute each test statistic under both the null and alternative hypotheses 2000 times and use the quantile of the simulated histogram of the test statistic under the null hypothesis to determine the critical value

The first regime: strong signal

fix $r = 0.2$
test different values of $p\in {0.25, 0.5, 0.75}$

generate random variables $\zeta_t$ under the alternative hypothesis,

first sample $P_{t, \cG_t}$ uniformly from $[m^{-r}, 1]$
then sample $\zeta_t$ from $U[0, P_{t,\cG_t}]$
when $p$ is small, a large fraction of the $\zeta_t$’s carry the watermark signal, and the power of both tests converges to 1 as $m$ increases
as $p$ becomes larger, the power of both tests decreases

The second regime: weak signal

set $P_{t, \cG_t} = 1-m^{-q}$ to maintain a consistently weak signal strength

Scenario 1: Fix $p$ and vary $q$. Set the parameter $p$ at 0.2 and vary the parameter $q\in {0.2, 0.3, 0.4, 0.5, 0.6}$

Scenario 2: Fix $q$ and vary $p$. Set $q$ at 0.4, and adjust $p\in {0.1, 0.2, 0.3}$

Scenario 3: On the boundary $p+q=1/2$

Scenario 4: On the boundary $2p+q = 1$

there is a consistent pattern across all scenarios: the sum test tends to outperform $HC$ when $m$ is relatively small

5 Language Model Experiments

prompt Microsoft’s instruction fine-tuned Phi-3 models on two question-answering datasets, namely ELI5 and FinQA

for all experiments, embed watermarks to the Phi-3-mini-4k-instruct model (3.8B parameter)

evaluate the text distortion induced by the watermark, use the S-BERT similarity score to compare the semantic similarity between the generated textof the same model with and without watermarking
- the higher the simiarlity score, the less the distortion
report the true positive rate (TPR) before and after attacks, including a purely substitution-based attack (TPR aug.) and a paraphrasing attack (TPR para.)

in addition, investigate the speculative decoding setup to assess the robustness of different watermarks where the goal is to modify watermarked texts so as to improve their quality

Published in categories

← previous next →

See all posts →

WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.