Detecting Watermarks

Detecting watermarks in text doesn't necessarily require access to the language model itself, but rather a good grasp of the hash function and random number generator used to establish the red list for each token. The method is straightforward: we count the violations of the red list rule and use this to test the hypothesis (H0) that the text sequence was generated without knowledge of the red list rule.

Since the red list is generated randomly, a natural writer should statistically violate the red list rule with about half of their tokens. In contrast, a watermarked model will produce sequences with no violations. Consider this: the odds that a natural source could produce a series of T tokens without a single red list rule violation are as slim as 1/2^T, a probability that diminishes rapidly even in relatively short text sequences. This statistical improbability allows us to detect watermarks effectively, even in something as brief as a synthetic tweet.

For a more robust analysis, we can deploy a one-proportion z-test to evaluate this null hypothesis. If our hypothesis holds, then the number of green list tokens in a sequence would on average be T/2, with a variance of T/4.

We reject H0 and confirm the presence of a watermark if the z-value exceeds a certain threshold, say 4. This threshold corresponds to a one-sided p-value of 3 × 10^-5, meaning the risk of a false positive—declaring a watermark where there is none—is exceptionally low. $z = \frac{2(|s|_G - T/2)}{\sqrt{T}}$

Better Watermark Attacking Watermark