Debiasing Watermarks via Maximal Coupling
Posted on
it presents a green/red list watermarking approach that partitions the token set into green and red lists, subtly increasing the generation probability for green tokens
- to correct token distribution bias, the method employs maximal coupling, using a uniform coin flip to decide whether to apply bias correction, with the result embedded as a pseudorandom watermark signal
Introduction
the earliest watermarking scheme, the Gumblel-max watermark. This method selects the next token $w_t$ using the decoder
\[w_t = S(P_t, \xi_t) = \argmax_{w\in \cW} \log \frac{U_w}{P_w}\]assuming $\xi_t = (U_1,\ldots, U_{\vert \cW\vert})$ is a random vector consisting of $\vert \cW\vert$ iid copies of $U[0, 1]$, serving as part of the watermark key.
while this watermarked decoder theoretically preserves the original next-token distribution when $\xi_t$ is truly random and achieves high detection efficiency, this decoder can generate repetitive text and significantly degrade text quality in practice.
the main reason is that, for existing implementations, the pseudorandom variable $\xi_t$ depends deterministically on the prior $k$ tokens. As a result, when conditioned on previous tokens, the decoder deterministically outputs the next token, which can lead to sub-optimal text generation
another is the green/red list watermark, employs a stochastic decoder.
it first partitions the token vocabulary $\cW$ into the green list $\cG$ and the red list $\cR$ (pseudo)randomly, and shares the green list $\cG$ with the detector as part of the watermark key.
the watermark is embeded by prompting the use of green tokens during decoding: given the NTP distribution $P_t$ and the green list $\cG$, the watermarked decoder constructs an alternative distribution $Q_t$ and samples the next token from $Q_t$.
the simplest version is the hard watermark, where each sampled token is limited in the green list $\cG$. Then it is more likely from $Q_t$ than $P_t$, thus, one can detect this difference by counting the number of green tokens in a given text.
- however, limiting sampling to a vocabulary subset distorts the NTP distribution $P_t$ and can degrade LLM generation significantly.
- despite these potential drawbacks, this watermarked decoder remains stochastic, as additional randomness is introduced during sampling from $Q$, while pseudorandomness is only used to set the green list. This extra randomness in the sampling process helps avoid repetitive generation
each of the two watermarking schemes has its own strengths
- on one hand, like the Gumbel-max watermark, we want the decoder $S$ to faithfully follows the original NTP distributions——a property knowns as unbiasedness
- on the other hand, similar to the green/red list watermark, we seek a stochastic decoder $S$ alleviate the negative impact of the pseudorandom variables on text generation
can we achieve the best of both worlds?
the paper introduces a novel watermarking scheme that leverage maximal coupling to balance these goals
it reminds me the cross-validation on graphon
the process begins by drawing a standard uniform random variable $\zeta$ and compute $p = \sum_{w\in\cW} \min(P_{t,w}, Q_{t, w})$.
- if $\zeta < p$, sample from the overlapping region
- otherwise, sample from the excess distribution
to aid detection, we share the information $\xi_t = (\cG, \zeta_t)$, the green list $\cG$ and the random variable $\zeta_t$, with the detector.
- $H_0$: the text $w_{1:n}$ is written by human; that is, $\zeta_t\mid (P_t, \cG)\sim_{iid} U[0, 1]$ for $t=1,\ldots, n$
- $H_1$: the text $w_{1:n}$ is written by a language model; that is, $\zeta_t\mid (P_t, \cG)\sim U[0, P_{t,\cG}]$ for $t=1,\ldots, n$
- $H_1^{(mix)}$: the text $w_{1:n}$ is first generated by a language model and then modified; that is $\zeta_t\mid (P_t,\cG)\sim (1-\varepsilon_n)U[0, 1] + \varepsilon_n U[0, P_{t, \cG}]$ for $t = 1,\ldots, n$
they focus on the sparse watermark signal case to facilitate theoretical analysis, $\varepsilon_n\asymp n^{-p}$ and $1 - P_{t,\cG} \asymp n^{-q}$ decay with $n$ at different rates
- Without sufficient watermark signals, that is, $p + 2q > 1$, $H_0$ and $H_1^{(mix)}$ merge asymptotically. Hence, no test can reliably separate $H_0$ and $H_1^{(mix)}$
- Given sufficient watermark signals, that is $p+2q < 1$, $H_0$ and $H_1^{(mix)}$ separate asymptotically. In this case, the alternative hypothesis can be reliably detected using the likelihood ratio test and higher criticism.
- with slightly stronger watermark signals, that is $p + q < 0.5$, the sum test, which rejects $H_0$ if the sum of the $\zeta_t$’s is too large, can reliably separate $H_0$ and $H_1^{(mix)}$ asymptotically.
another application of maximal coupling is speculative decoding, where a smaller draft model generates speculative token predictions that are then accepted or corrected by a more powerful target model. The paper apply this setup to the watermarking scheme and demonstrate that the method effectively retains the watermark signal despite the targeted attempts to modify the text.
2. Preliminaries
maximal coupling is a powerful tool in probability theory that constructs a coupling between two random variables to maximize the probability that they are equal
3. Watermarking Scheme via Maximal Coupling
a watermarking scheme has two parts
3.1 Decoding
3.2 Detection
- Detection as Hypothesis Testing
- Sum Test
- Higher Criticism
- Practical Considerations: the probability of sampling a token from a green list can be arbitrarily close to 1, or even equal to 1
3.3 Watermark Key
in cases where the watermarked text has a predetermined maximum length $n$, the language model provider can pre-generate both the green lists $\cG_{1:n}$ and uniform random variables $\zeta_{1:n}$, sharing them with the detector as the watermark key
for varying lengths, the language model provider supplies two distinct pseudorandom functions as the watermark key
- the first generates the green lists
- the second generates the uniform random variables
3.4 Interpret Speculative Decoding as Post-processing
4 Simulation Studies
two distinct scenarios
- situations where each green token carrying the watermark signal exhibits a sufficient signal strength, that is, only require $P_{t, \cG_t} \ge m^{-r}$
- scenario deals with a weak signal condition, that is, $P_{t, \cG_t} = 1-m^{-q}$
compute each test statistic under both the null and alternative hypotheses 2000 times and use the quantile of the simulated histogram of the test statistic under the null hypothesis to determine the critical value
The first regime: strong signal
- fix $r = 0.2$
- test different values of $p\in {0.25, 0.5, 0.75}$
generate random variables $\zeta_t$ under the alternative hypothesis,
- first sample $P_{t, \cG_t}$ uniformly from $[m^{-r}, 1]$
-
then sample $\zeta_t$ from $U[0, P_{t,\cG_t}]$
- when $p$ is small, a large fraction of the $\zeta_t$’s carry the watermark signal, and the power of both tests converges to 1 as $m$ increases
- as $p$ becomes larger, the power of both tests decreases
The second regime: weak signal
set $P_{t, \cG_t} = 1-m^{-q}$ to maintain a consistently weak signal strength
- Scenario 1: Fix $p$ and vary $q$. Set the parameter $p$ at 0.2 and vary the parameter $q\in {0.2, 0.3, 0.4, 0.5, 0.6}$
- Scenario 2: Fix $q$ and vary $p$. Set $q$ at 0.4, and adjust $p\in {0.1, 0.2, 0.3}$
- Scenario 3: On the boundary $p+q=1/2$
- Scenario 4: On the boundary $2p+q = 1$
there is a consistent pattern across all scenarios: the sum test tends to outperform $HC$ when $m$ is relatively small
5 Language Model Experiments
prompt Microsoft’s instruction fine-tuned Phi-3 models on two question-answering datasets, namely ELI5 and FinQA
for all experiments, embed watermarks to the Phi-3-mini-4k-instruct model (3.8B parameter)
- evaluate the text distortion induced by the watermark, use the S-BERT similarity score to compare the semantic similarity between the generated textof the same model with and without watermarking
- the higher the simiarlity score, the less the distortion
- report the true positive rate (TPR) before and after attacks, including a purely substitution-based attack (TPR aug.) and a paraphrasing attack (TPR para.)
in addition, investigate the speculative decoding setup to assess the robustness of different watermarks where the goal is to modify watermarked texts so as to improve their quality