Robust Detection of Watermarks for LLM under Human Edits
Posted on
the pervasive presence of human edits on LLM-generated text dilutes watermark signals, thereby significantly degrading detection performance of existing methods
the paper models human edits through mixture model detection, and introduce a new method——a truncated goodness-of-fit test (Tr-GoF) for detecting watermarked text under human edits
- Tr-GoF achieves optimality in robust detection of the Gumbel-max watermark in a certain asymptotic regime of substantial text modifications and vanishing watermark signals
- Tr-GoF achieves this optimality adaptively without requiring precise knowledge of human edit levels or probabilitistic specifications of LLMs, unlike the optimal but impractical Neyman-Pearson likelihood ratio test
- Tr-GoF attains the highest detection efficiency rate under moderate text modifications
Introduction
model the watermark signal across all tokens as a mixture: with probability $\varepsilon_n$, a token carries signal; otherwise, it contributes noise.
phase transition for watermark detection
assume $\varepsilon_n\asymp n^{-p}$ and $1 - \max_{w\in\cW} P_{t,w}\asymp n^{-q}$ for constants $0 < p, q < 1$
the finding is that the optimal detection region in the $(p, q)$ plane has boundary $q + 2p = 1$
unlike sum-based approaches, this goodness-of-fit test adaptively identifies areas of significant departure between null and alternative distributions without requiring prior knowledge of the watermark fraction or NTP regularity
related work
- most robust watermarking methods focus on algorithmic or cryptographic approaches
- algorithmic methods emphasize resilient watermarking designs
- cryptographic approaches protect watermark signals through robust pseudorandom codes
- besides mixture model, another statistical approach to robustness is the edit tolerance limit, which quantifies the maximum proportion of edits a detection method can withstand while remaining effective. studies often apply this perspective to the green-red list watermark
- other forms of robustness include decoder-focussed and cryptographic approaches
- robust statistics, which traditionally addresses resilience to outliers and perturbations
- here, the mixture formulation connects with Huber’s contamination model. However, unlike in Huber’s model where the contaminated distribution is typically unknown, in watermark detection, this distribution is known.
- this work is also related to sparse mixture detection
The Tr-GoF detection method
w.l.o.g., assume the data used in Tr-GoF are all $p$-values, which is always feasible by transforming observations using the CDF of $\mu_0$
Let $\bbF_n(r)$ denote the empirical distibution of the $p$-values,
\[\bbF_n(r) = \frac 1n\sum_{t=1}^n 1_{p_t\le r}, \qquad \text{for } r\in [0, 1]\]and its expected distribution under $H_0$ is always the unform $U[0, 1]$, with CDF $\bbF_0(r) = r$ for $r\in [0, 1]$
Tr-GoF utilizes the following test statistic to measure the deviation between the empirical CDF $\bbF_n$ and the expected CDF $\bbF_0$:
\[S_n^+(s) = \sup_{r\in [p^+, 1]} K_s^+(\bbF_n(r), \bbF_0(r))\]there are two truncations:
- on the $r$-domain, where $p^+ = \sup{p_{(t)}: p_{(t)} \le c_n^+}$ with $c_n^+\in [0, 1]$ introduced for stability and $p_{(t)}$ is the $t$-th smallest $p$-values among $Y_{1:n}$
- the second is to truncate $K_s$ to $K_s^+$, which is defined as a truncated version of $K_s$ defined by
where $K_s(u, v)$ represents the $\phi_s$-divergence between $\text{Ber}(u)$ and $\text{Ber}(v)$:
\[K_s(u, v) = D_{\phi_s}(\text{Ber}(u)\Vert \text{Ber}(v)) = v\phi_s(u/v) + (1 - v) \phi_s\left(\frac{1-u}{1-v}\right)\]define the detection method as follows: for a small $\delta > 0$, reject $H_0$ in favour of $H_1^{(mix)}$ if
\[n\cdot S_n^+(s) \ge (1+\delta)\log\log n\]Remarks:
- selection of the critical value: in practice, they use the Monte Carlo simulations
- the Tr-GoF test departs from the GoF test by using two truncations
- drawbacks of an untruncated $r$-domain. The untruncated $S_n(s)$ has two main drawbacks
- its weak convergence is slow
- without removing extreme values from small $p$-values, $S_n(s)$ is prone to a heavy-tail issue
- relationship to higher criticism. The Higher Criticism (HC) is a special instance of Tr-GoF
4. Theoretical Guarantees
4.1 Detectability
- Heavy edits case
- Light edits case
4.2 Adaptive optimality
- Optimal adaptivity
- Suboptimality of sum-based detection rules
4.3 Optimal efficiency rate
define efficiency as the rate of exponential decrease in Type II errors for a fixed signficance level $\alpha$, considering the least-favourable NTP distribution within a belief class $\cP$
- Optimal $\cP_\Delta$-efficiency