LLM Watermark Detection Reproducibility
The Kirchenbauer watermark detector’s z-score — bit-identical across platforms under SolvNum, with stable detection decisions on every borderline case.
- 0
- SolvNum receipt mismatches
- 2
- Float64 receipt mismatches (expected)
- 1,296
- Detector calls per implementation
- ≤1 bit
- Math-library divergence absorbed
The scenario
Set the picture
Meta and peers have published watermarking schemes for LLM outputs. Every scheme has a detector: a statistical test that computes a z-score and applies a threshold to decide “watermark present” or “watermark absent.”
When the z-score is far from the threshold, drift doesn’t matter. When it’s borderline, platform-dependent arithmetic drift can flip the detection decision — one machine says “AI-generated,” another says “not AI-generated” on the same input.
Cost today
The published detector papers do not analyze detection-decision stability across platforms. There is a gap in the literature.
EU AI Act Article 50 requires transparency about AI-generated content. If the detector that enforces this obligation produces different answers on different machines, the regulatory artifact is unreliable.
What changes with SolvNum
SolvNum-backed Kirchenbauer 2023 z-score detector: bit-identical receipt hash across Windows and Linux. SolvNum-backed Kirchenbauer 2024 weighted detector: same — bit-identical.
Float64 weighted detector: different receipt hashes across the two hosts. The per-token multiplication and reduction accumulate rounding differences that surface as receipt mismatches and, on borderline texts, flipped detection decisions.
The SolvNum detector returns integer triples (sign, q, e) for hashing — never re-converts to float. That’s why the receipt is stable: the hash path never touches libm.
Measurable outcome
What we claim — and how it survives review
Each line below maps to a captured number in the demo section. Every number is reproducible from the benchmark suite.
- SolvNum k2023 detector: receipt hash match across Windows and Linux.
- SolvNum k2024 weighted detector: receipt hash match across Windows and Linux.
- Float64 k2024 weighted (naive + Kahan): receipt hash MISMATCH across Windows and Linux.
- SolvNum mismatches: 0. Float mismatches: 2 (both expected on cross-platform runs).
- 1,296 detector calls (216 sequences × 3 reduction orders × 2 sqrt paths) per implementation.
The demo
What was tested. How. What the script printed.
Synthetic corpus: 216 sequences at varying watermark strengths (γ = 0.25, 0.5, 0.75) and lengths (50, 200, 1000 tokens). Three implementations: float64_naive, float64_kahan, solvnum_backed. Two detectors: Kirchenbauer 2023 z-score, Kirchenbauer 2024 weighted.
Cross-platform verification: bench runs on both hosts, writes anchor files. Verify script compares anchor hashes. SolvNum mismatches must be 0; float mismatches are expected and informational.
Captured benchmark output
The numbers the script actually printed.
| Detector × Implementation | Windows hash | Linux hash | Match |
|---|---|---|---|
| k2023 / solvnum_backed | 9c0a6229… | 9c0a6229… | ✓ MATCH |
| k2024_weighted / solvnum_backed | 27c76c7a… | 27c76c7a… | ✓ MATCH |
| k2024_weighted / float64_naive | ad6e9f35… | 7b9bcf49… | ✗ DIFFER |
| k2024_weighted / float64_kahan | 9870edff… | cd551bf5… | ✗ DIFFER |
k2023 float64 happens to match on these two x86_64 hosts because the only FP work is sqrt over integers. Not a property to rely on.
Composes with
Where this POC sits in the suite
Evidence pointers
Where the claims live in the repo
These are the files a reviewer should run to re-derive every number on this page.
- tools/solvnum/buyer_pocs/llm_watermark/bench.py
- tools/solvnum/buyer_pocs/llm_watermark/verify.py
- tools/solvnum/buyer_pocs/reports/llm_watermark_anchors_win.json
- tools/solvnum/buyer_pocs/reports/llm_watermark_anchors_wsl.json
- docs/poc/05_llm_watermark.md
- docs/poc/05_llm_watermark_xplat_evidence.md
Want to see these receipts on your pipeline?
Run the benchmark against your actual decision pipeline.
Two weeks, $25K, fully credited. No production integration, no data leaving your premises. Every claim above traces back to a script you can run locally.