Multi-Bit Watermarking

tl;dr Recently, my research group and I have been collaborating with scientists at Amazon to design, implement, and test multi-bit watermarking schemes. We will likely release our pre-print soon. But until then, in this blog post, I present some thoughts on multi-bit text watermarking.

Text watermarking can be introduced as a digital provenance tool: a model provider slightly changes generation so that later, with the right detector, the model provider can tell (preferably with overwhelmingly high confidence) whether a some generated text came from its model. That is the zero-bit version: the hidden payload or message is just “watermark present.” The more interesting version is multi-bit watermarking, where the generated text carries a message: a user ID, model ID, timestamp, content-policy tag, licensing tag, or audit trail.

That extra payload is useful, but it creates a fundamental tension. The process of embedding more bits requires the output distribution to depend more strongly on the hidden message; stronger dependence makes decoding easier, but also makes the text easier to distinguish from ordinary model output. Our latest work (to be released soon!) states this tradeoff directly: higher payloads tend to increase detectability, while stronger distortion-free requirements reduce achievable rates.

This is not a brand-new conceptual problem. It is the LLM version of a much older question in information hiding, data hiding, steganography, and digital watermarking: how many bits can be hidden in a host object while preserving some notion of fidelity, stealth, or robustness? Moulin and O’Sullivan’s information-theoretic treatment describes information hiding as hiding information in a host data set so that it can be reliably communicated to a receiver, covering watermarking, fingerprinting, steganography, and data embedding [1]. Cachin’s information-theoretic steganography model casts the adversary’s task as a hypothesis test between innocent cover messages and stego messages, with security quantified through distributional divergence [2]. Chen and Wornell’s quantization-index modulation work studies embedding a signal, such as a digital watermark, inside a host signal to form a composite signal, with provable rate-distortion-robustness behavior [3].

The LLM twist is that the “host” is not a fixed image, audio clip, or document. It is a next-token distribution. At each generation step, the base model gives a distribution (Q_t) over the vocabulary, and the watermarker chooses a nearby distribution (Q_t^\star) from which it samples. This turns multi-bit text watermarking into a channel coding problem with a distributional constraint.

From zero-bit detection to multi-bit communication

A zero-bit watermark asks:

H_0: X_{1:n} \sim ordinary model versus H_1: X_{1:n} \sim watermarked model.

A multi-bit watermark asks a stronger question:

M \in \{0,1\}^k \quad \longrightarrow \quad X_{1:n} \quad \longrightarrow \quad \widehat M,

where M is the hidden message, X_{1:n} is the generated text, and \widehat M is the decoded message. The receiver may know a secret key, a public codebook, and perhaps the base model. The adversary may try to detect, remove, forge, or corrupt the watermark.

Early LLM watermarking work focused largely on zero-bit detection. Kirchenbauer et al.’s ICML watermark, for example, randomly selects “green” tokens before generation and softly promotes them during sampling, giving an efficient statistical detector [4]. SynthID-Text is another prominent watermarking system for LLM outputs, described as preserving text quality while enabling efficient detection with minimal latency overhead. [5] Multi-bit schemes move beyond detection into payload extraction; for example, multi-bit text watermarking methods have been proposed for traceability, robust extraction, and paraphrase-resilient embedding.

Our work frames this shift using an information-theoretic channel view: reliable message recovery is governed by the conditional mutual information of the induced watermark channel, and different distortion regimes create different capacity-detectability frontiers.

The basic channel model

Let \mathcal X be the vocabulary. At time t, the base LLM gives

Q_t \in \Delta(\mathcal X),

a next-token distribution conditioned on the previous context. A multi-bit watermark has:

M \in \{0,1\}^k

as the message, and a secret key or side information S_t. The encoder maps (M,S_t) into a state

Z_t = f_t(M,S_t),

then samples

X_t \sim P_{X_t|Z_t=z}.

The key design constraint is that each conditional law P_{X_t|Z_t=z} should remain close to the base law Q_t. To quantify closeness, one could use total variation distance: d_{\mathrm{TV}}(P,Q) = \frac12 \sum_{x \in \mathcal X} |P(x)-Q(x)|. A watermarked distribution is called uniformly per-key \varepsilon-distortion-free when, for every time, message, and realized key, the conditional token distribution stays within \varepsilon total variation of the base distribution.

This definition matters because total variation has an operational meaning: it is exactly the largest possible distinguishing advantage of any detector trying to decide whether a sample came from P or Q.

Total variation is maximum distinguishing advantage

Let P and Q be distributions on a finite alphabet \mathcal X. A detector is a function \phi:\mathcal X \to \{0,1\}. Its distinguishing advantage is \Pr_{X\sim P}[\phi(X)=1] - \Pr_{X\sim Q}[\phi(X)=1].

For a fixed detector,

\mathbb E_P[\phi(X)]-\mathbb E_Q[\phi(X)] = \sum_x \phi(x)(P(x)-Q(x)).

Let

A = \{x : P(x) \ge Q(x)\}.

Since 0 \le \phi(x) \le 1, \sum_x \phi(x)(P(x)-Q(x)) \le \sum_{x\in A} (P(x)-Q(x)).

But \sum_{x\in A} (P(x)-Q(x)) = \frac12 \sum_x |P(x)-Q(x)| = d_{\mathrm{TV}}(P,Q).

The upper bound is achieved by the detector \phi^\star(x)=1\{x\in A\}. Taking the absolute value allows either P or Q to be the larger distribution on the chosen set. Therefore, \sup_\phi \left| \Pr_{P}[\phi(X)=1]-\Pr_Q[\phi(X)=1]\right| = d_{\mathrm{TV}}(P,Q).

So a per-token TV budget is a bound on the best possible one-token test. In our work, we have been using the TV budget to implement watermarking schemes and test the detectability and message-recoverability of such schemes.

How this relates to information hiding, data hiding, and steganography

The terminology across communities is inconsistent, but the underlying mathematical formulation is (fairly) stable.

Information hiding is the broad umbrella. There is a host object, a hidden message, a distortion or detectability constraint, and a receiver. The goal is reliable communication through the host.

Data hiding often emphasizes embedding payload bits into media. The fidelity constraint may be perceptual: image distortion, audio quality, semantic preservation, or edit distance.

Digital watermarking often emphasizes provenance, ownership, authentication, or tracing. The hidden message may be a copyright mark, model identifier, user identifier, or policy tag. Robustness against attacks is often central.

Steganography emphasizes secrecy of the very existence of communication. The adversary’s detection problem is primary. Cachin’s model, with a passive adversary distinguishing cover from stego distributions, is especially close to modern distributional formulations of LLM watermark stealth.

Multi-bit LLM watermarking sits at the intersection. It is data hiding because it embeds a payload. It is watermarking because the payload is usually provenance or attribution metadata. It is steganography when the watermarked output must be statistically indistinguishable from ordinary model output. It is channel coding because reliable recovery is governed by mutual information and error-correcting codes.

Our work explicitly connects LLM watermarking to the information-theoretic literature on watermarking, data hiding, and steganography, noting that classical work models watermarking as communication over a constrained channel and that the LLM setting replaces perceptual host distortion with distributional constraints on the base next-token law.

The central frontier

In my opinion, a good multi-bit watermark should satisfy at least three of these:

  1. Payload: many bits per token.
  2. Reliability: low message error after generation and possible edits.
  3. Distortion-freeness/Stealth/Low-detectability: small statistical distance from ordinary model output.
  4. Utility: low degradation of fluency, factuality, reasoning, and user experience.

Core results in information theory imply that these cannot all be maximized simultaneously. Stronger distortion-free constraints reduce mutual information. More robustness requires redundancy. More redundancy lowers rate. More aggressive perturbations improve decoding but increase detectability and may degrade quality.

The right question is not “Can we make a perfect multi-bit watermark?” It is: What is the best achievable rate at a given detectability, robustness, and quality budget?

That is a capacity question!

References

[1] Pierre Moulin and Joseph A. O’Sullivan. Information-theoretic analysis of information hiding. IEEE Trans. Inf. Theory, 49(3):563–593, 2003.

[2] Christian Cachin. An information-theoretic model for steganography. Information and Computation, 192(1):41–56, 2004.

[3] B. Chen and G. W. Wornell. Quantization index modulation: a class of provably good methods for digital watermarking and information embedding. IEEE Trans. Inf. Theory, 47(4):1423–1443, September 2006.

[4] John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models. In Proceedings of the 40th International Conference on Machine Learning, 2023.

[5] Sumanth Dathathri, Abigail See, Sumedh Ghaisas, Po-Sen Huang, Rob McAdam, Johannes Welbl, Vandana Bachani, Alex Kaskasoli, Robert Stanforth, Tatiana Matejovicova, Jamie Hayes, Nidhi Vyas, Majd Merey, Jonah Brown-Cohen, Rudy Bunel, Borja Balle, Ali Cemgil, Zahra Ahmed, Kitty Stacpoole, and Pushmeet Kohli. Scalable watermarking for identifying large language model outputs. Nature, 634:818–823, 2024.

Leave a comment