WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Robust Detection of Watermarks for LLM under Human Edits

Posted on
Tags: Large Language Model, Mixture Model, Watermarking, Maximal Coupling

This note is for Li, X., Ruan, F., Wang, H., Long, Q., & Su, W. J. (2025). Robust detection of watermarks for large language models under human edits. Journal of the Royal Statistical Society Series B.

the pervasive presence of human edits on LLM-generated text dilutes watermark signals, thereby significantly degrading detection performance of existing methods

the paper models human edits through mixture model detection, and introduce a new method——a truncated goodness-of-fit test (Tr-GoF) for detecting watermarked text under human edits

  • Tr-GoF achieves optimality in robust detection of the Gumbel-max watermark in a certain asymptotic regime of substantial text modifications and vanishing watermark signals
  • Tr-GoF achieves this optimality adaptively without requiring precise knowledge of human edit levels or probabilitistic specifications of LLMs, unlike the optimal but impractical Neyman-Pearson likelihood ratio test
  • Tr-GoF attains the highest detection efficiency rate under moderate text modifications

Introduction

model the watermark signal across all tokens as a mixture: with probability $\varepsilon_n$, a token carries signal; otherwise, it contributes noise.

phase transition for watermark detection

assume $\varepsilon_n\asymp n^{-p}$ and $1 - \max_{w\in\cW} P_{t,w}\asymp n^{-q}$ for constants $0 < p, q < 1$

the finding is that the optimal detection region in the $(p, q)$ plane has boundary $q + 2p = 1$

unlike sum-based approaches, this goodness-of-fit test adaptively identifies areas of significant departure between null and alternative distributions without requiring prior knowledge of the watermark fraction or NTP regularity

  • most robust watermarking methods focus on algorithmic or cryptographic approaches
    • algorithmic methods emphasize resilient watermarking designs
    • cryptographic approaches protect watermark signals through robust pseudorandom codes
  • besides mixture model, another statistical approach to robustness is the edit tolerance limit, which quantifies the maximum proportion of edits a detection method can withstand while remaining effective. studies often apply this perspective to the green-red list watermark
  • other forms of robustness include decoder-focussed and cryptographic approaches
  • robust statistics, which traditionally addresses resilience to outliers and perturbations
    • here, the mixture formulation connects with Huber’s contamination model. However, unlike in Huber’s model where the contaminated distribution is typically unknown, in watermark detection, this distribution is known.
  • this work is also related to sparse mixture detection

The Tr-GoF detection method

w.l.o.g., assume the data used in Tr-GoF are all $p$-values, which is always feasible by transforming observations using the CDF of $\mu_0$

Let $\bbF_n(r)$ denote the empirical distibution of the $p$-values,

\[\bbF_n(r) = \frac 1n\sum_{t=1}^n 1_{p_t\le r}, \qquad \text{for } r\in [0, 1]\]

and its expected distribution under $H_0$ is always the unform $U[0, 1]$, with CDF $\bbF_0(r) = r$ for $r\in [0, 1]$

Tr-GoF utilizes the following test statistic to measure the deviation between the empirical CDF $\bbF_n$ and the expected CDF $\bbF_0$:

\[S_n^+(s) = \sup_{r\in [p^+, 1]} K_s^+(\bbF_n(r), \bbF_0(r))\]

there are two truncations:

  • on the $r$-domain, where $p^+ = \sup{p_{(t)}: p_{(t)} \le c_n^+}$ with $c_n^+\in [0, 1]$ introduced for stability and $p_{(t)}$ is the $t$-th smallest $p$-values among $Y_{1:n}$
  • the second is to truncate $K_s$ to $K_s^+$, which is defined as a truncated version of $K_s$ defined by
\[K_s^+(u, v) = \begin{cases} K_s(u, v) & \text{ if } 0 < v < u < 1, 0 & \text{ otherwise} \end{cases}\]

where $K_s(u, v)$ represents the $\phi_s$-divergence between $\text{Ber}(u)$ and $\text{Ber}(v)$:

\[K_s(u, v) = D_{\phi_s}(\text{Ber}(u)\Vert \text{Ber}(v)) = v\phi_s(u/v) + (1 - v) \phi_s\left(\frac{1-u}{1-v}\right)\]

define the detection method as follows: for a small $\delta > 0$, reject $H_0$ in favour of $H_1^{(mix)}$ if

\[n\cdot S_n^+(s) \ge (1+\delta)\log\log n\]

Remarks:

  1. selection of the critical value: in practice, they use the Monte Carlo simulations
  2. the Tr-GoF test departs from the GoF test by using two truncations
  3. drawbacks of an untruncated $r$-domain. The untruncated $S_n(s)$ has two main drawbacks
    • its weak convergence is slow
    • without removing extreme values from small $p$-values, $S_n(s)$ is prone to a heavy-tail issue
  4. relationship to higher criticism. The Higher Criticism (HC) is a special instance of Tr-GoF

4. Theoretical Guarantees

4.1 Detectability

  • Heavy edits case
  • Light edits case

4.2 Adaptive optimality

  • Optimal adaptivity
  • Suboptimality of sum-based detection rules

4.3 Optimal efficiency rate

define efficiency as the rate of exponential decrease in Type II errors for a fixed signficance level $\alpha$, considering the least-favourable NTP distribution within a belief class $\cP$

  • Optimal $\cP_\Delta$-efficiency

Published in categories