How Many Rewrites to Strip a Watermark? Empirical Paraphrase-Removal Curves for LLM Watermarks

Cross-model paraphrasing drops Kirchenbauer watermark detection from 100% to 60% in a single pass. After ten passes, it plateaus at 40%. The watermark is partially robust — but not enough for adversarial settings where the attacker has access to any LLM.

I measured this by watermarking text with GPT-2, paraphrasing with Claude Haiku, and tracking how the z-score decays. Five experiments. Six pre-registered hypotheses. Real green-list watermarking with logit access.

Detection rate vs paraphrase passes

What We Built

We implemented the Kirchenbauer et al. (2023) green-list watermark using GPT-2 (124M parameters) with direct logit access. The algorithm works at generation time: before each token is sampled, the model partitions the vocabulary into “green” and “red” lists based on a hash of the previous token, then adds a bias (δ=2.0) to green-list logits. Detection checks whether the text uses more green tokens than expected by chance, computing a z-score.

For the adversarial attack, we use Claude Haiku as a cross-model paraphraser — a completely different architecture with no knowledge of the watermarking key. Each paraphrase pass asks the model to completely rewrite the text while preserving meaning.

This cross-model setup tests the realistic attack scenario: an adversary doesn’t know which watermark scheme was used or what the secret key is, but can access a cheap LLM to rewrite the text.

What We Found

At pass=0 (no paraphrasing), watermarked text is strongly detectable. Mean z-score of 9.64±1.03 across 5 seeds, with 100% detection rate. The green-list fraction is ~84-86%, far above the 50% expected by chance. The Kirchenbauer watermark works.

Cross-model paraphrasing degrades the signal — but not as fast as you’d think. One pass drops detection from 100% to 60%. Then it plateaus. Even after 10 passes, 40% of watermarked texts are still detectable.

Passes	Mean z-score	Detection Rate
0	9.64 ± 1.03	100%
1	5.21 ± 1.74	60%
2	4.85 ± 1.75	60%
3	4.72 ± 0.96	60%
5	4.66 ± 1.87	60%
10	3.89 ± 1.26	40%

Z-score decay under paraphrasing

The Recovery Story: Why v1 Failed and v2 Worked

This project has a backstory worth telling. Our first attempt (v1) used a completely different approach: output-level synonym substitution. Since we were initially using the Claude API (no logit access), we approximated the watermark by replacing common words with “green” synonyms from a 15-pair vocabulary.

v1 result: 0% detection across 45 experimental conditions.

Not because watermarks are robust — because the simulation was structurally too weak. With only 15 synonym pairs, a typical 200-word text contains just 1-11 signal words. You need at least 16 for z > 2.0 detection. The experiment was unable to produce results regardless of any parameter.

The key insight: logit-level watermarking biases every token (~200 data points per text). Output-level substitution only touches vocabulary-matched words (~5 per text). That’s a ~30x signal density gap.

Metric	v1 (Synonym)	v2 (Kirchenbauer)
Signal words per 200-token text	~5	~149
E0 z-score	0.94 (noise)	8.44 (strong)
Detection at pass=0	0%	100%

Watermark Strength and Text Length

We also measured how watermark strength (the δ parameter) and text length affect robustness under paraphrasing.

Watermark strength (E4): Higher δ means stronger bias toward green tokens. E0 confirms the dose-response: δ=1.0 produces z=4.32, while δ=4.0 produces z=8.83. Stronger watermarks should survive more paraphrase passes — E4 tests this with paraphrasing.

Watermark strength vs robustness

Text length (E3): Longer texts have more tokens scored, providing more statistical power for detection. Shorter texts should be more vulnerable to watermark removal. E3 tests 50, 150, and 300 token lengths under 3 paraphrase passes.

Text length vs robustness

False Positive Rate

A watermark detector is useless if it falsely flags unwatermarked text. E5 tests the false positive rate on two categories: unwatermarked GPT-2 output and human-written text. E0b already shows z=-0.08 for unwatermarked text (well below the z=4.0 threshold).

False positive rate

What the Governance Framework Caught

This project was a deliberate governance pressure test — applying our research governance (50+ rules, pre-registered hypotheses, quality gates) to an unfamiliar domain.

What worked:

Pre-registration forced honest reporting of the v1 negative result. We could not retroactively claim we were “testing simulation fidelity.” The hypotheses were about watermark robustness, and they failed.
E0 sanity validation (v2) uses realistic LLM output per LL-94, not hand-crafted sentences with artificially high signal density.
R52 (Autonomous Quality Loop) caught the 7 fixable gaps in the project and guided the iteration from v1 (5/10) toward v2 (target ≥8/10).

What v1 missed (now fixed):

No power analysis at Gate 0.5. A back-of-envelope calculation would have revealed that 15 synonym pairs produce insufficient signal. Now mandatory per LL-93.
E0 sanity on unrealistic data. v1’s sanity check sentence contained 10+ signal words — far more than real LLM output. Now all E0 tests must use realistic/production-like inputs per LL-94.

Reproducibility

All code, data, and experiment outputs are in the repository. Watermarking uses GPT-2 (124M) via HuggingFace Transformers. Paraphrasing uses Claude 3 Haiku (claude-3-haiku-20240307). 5 fixed seeds (42, 123, 456, 789, 1024). Run reproduce.sh to replicate.

Kirchenbauer et al. (2023) — “A Watermark for Large Language Models”: The green-list watermarking method we implement and test.
Sadasivan et al. (2023) — “Can AI-Generated Text be Reliably Detected?”: Theoretical argument that paraphrasing defeats detection. We provide empirical pass-count curves.
Krishna et al. (2023) — “Paraphrasing Evades AI-Generated Text Detectors”: Showed DIPPER paraphraser defeats detectors. We extend to LLM-as-paraphraser (cross-model attack).
Christ et al. (2024) — “Undetectable Watermarks for Language Models”: Theoretical robustness bounds. We test the practical Kirchenbauer scheme.
Mitchell et al. (2023) — “DetectGPT”: Zero-shot detection without watermarks, an alternative approach.
Zhao et al. (2023) — “Provable Robust Watermarking”: Theoretical robustness analysis we empirically test.

Limitations

This study uses GPT-2 (124M parameters) for watermark generation — a model small enough for CPU-based experimentation but not representative of production LLM output quality. Larger models may produce text with different statistical properties that affect watermark robustness. The paraphrase attack uses a single T5-based paraphraser; real-world attackers would use multiple rewriting strategies. Detection thresholds (z=4.0) are specific to the green-list watermarking scheme tested and may not generalize to other watermark families.

What’s Next

The paraphrase-removal curves establish a baseline for watermark decay — but the real question is whether any watermark scheme survives adaptive attackers who know the detection method. Next steps: test logit-based watermarks (which embed signal in token probabilities rather than green-lists), evaluate multi-pass paraphrase with different rewriting models, and measure the detection-quality tradeoff curve — how much text quality do you sacrifice to maintain detectability?

Rex Coleman is securing AI from the architecture up — building and attacking AI security systems at every layer of the stack, publishing the methodology, and shipping open-source tools. rexcoleman.dev · GitHub

If this was useful, subscribe on Substack for weekly AI security research — findings, tools, and curated signal.

What We Built#

What We Found#

The Recovery Story: Why v1 Failed and v2 Worked#

Watermark Strength and Text Length#

False Positive Rate#

What the Governance Framework Caught#

Reproducibility#

Related Work#

Limitations#

What’s Next#