Watermark in LLMs

1.WaterPool: A Watermark Mitigating Trade-offs among Imperceptibility, Efficacy and Robustness(WaterPool:在不可感知性、有效性和稳健性之间权衡利弊的水印缓释器)

随着大型语言模型(LLM)在日常生活中的使用越来越多,人们开始关注其潜在的滥用和社会影响。有人提出了水印技术,通过在生成的文本中注入模式来追踪特定模型的使用情况。理想的水印应该产生与原始 LLM 几乎没有区别的输出(不可感知性),同时确保高检测率(有效性),即使文本被部分修改(鲁棒性)。尽管已经提出了很多方法,但没有一种方法能同时实现这三个特性,这就暴露了内在的权衡问题。本文利用以密钥为中心的方案,将水印分解为两个不同的模块:密钥模块和标记模块,从而统一了现有的水印技术。通过这种分解,我们首次证明了密钥模块在很大程度上导致了先前方法中出现的权衡问题。具体来说,这反映了生成过程中密钥采样空间的规模与检测过程中密钥恢复的复杂性之间的冲突。为此,我们引入了 \textbf{WaterPool},这是一种简单而有效的密钥模块,它保留了不可感知性所需的完整密钥采样空间,同时利用基于语义的搜索来改进密钥还原过程。WaterPool 可以作为插件与大多数水印集成。我们用三种著名的水印技术进行的实验表明,WaterPool显著提高了它们的性能,达到了接近最佳的不可感知性,并明显提高了功效和鲁棒性(KGW提高了12.73%,EXP提高了20.27%,ITS提高了7.27%)。

With the increasing use of large language models (LLMs) in daily life, concerns have emerged regarding their potential misuse and societal impact. Watermarking is proposed to trace the usage of specific models by injecting patterns into their generated texts. An ideal watermark should produce outputs that are nearly indistinguishable from those of the original LLM (imperceptibility), while ensuring a high detection rate (efficacy), even when the text is partially altered (robustness). Despite many methods having been proposed, none have simultaneously achieved all three properties, revealing an inherent trade-off. This paper utilizes a key-centered scheme to unify existing watermarking techniques by decomposing a watermark into two distinct modules: a key module and a mark module. Through this decomposition, we demonstrate for the first time that the key module significantly contributes to the trade-off issues observed in prior methods. Specifically, this reflects the conflict between the scale of the key sampling space during generation and the complexity of key restoration during detection. To this end, we introduce \textbf{WaterPool}, a simple yet effective key module that preserves a complete key sampling space required by imperceptibility while utilizing semantics-based search to improve the key restoration process. WaterPool can integrate with most watermarks, acting as a plug-in. Our experiments with three well-known watermarking techniques show that WaterPool significantly enhances their performance, achieving near-optimal imperceptibility and markedly improving efficacy and robustness (+12.73\% for KGW, +20.27\% for EXP, +7.27\% for ITS).

2.Enhancing Watermarked Language Models to Identify Users(增强水印语言模型以识别用户)

零位水印语言模型生成的文本与底层模型的文本无法区分,但可以通过密钥检测出是机器生成的。但是,仅仅把人工智能生成的垃圾邮件检测为带水印的垃圾邮件可能无法防止未来的滥用。如果我们能额外追踪到垃圾邮件发送者的 API 标记,就能切断他们对模型的访问。
我们引入了多用户水印,这样就可以追踪到个人或串通用户群体的模型生成文本。我们从不可检测的零位水印方案中构建了多用户水印方案。重要的是,我们的方案同时提供了零位和多用户保证:既能像原始方案一样检测到较短的片段,又能追踪到个人的较长摘录。同时,我们还给出了将长信息嵌入生成文本的水印方案的通用结构。
我们的方案是语言模型水印方案之间的首次黑箱还原。黑盒还原的一个主要挑战是缺乏统一的鲁棒性抽象–即标记文本在编辑后仍能被检测到。现有的工作基于对语言模型输出和用户编辑的定制要求,给出了不可比拟的鲁棒性保证。我们引入了一个新的抽象概念–AEB-鲁棒性–来克服这一挑战。AEB-robustness 规定,只要编辑的文本 “接近模型生成输出的足够块”,水印就能被检测到。指定稳健性条件相当于定义近似、足够和区块。利用我们新的抽象方法,我们将我们构建的鲁棒性特性与底层零位方案的鲁棒性特性联系起来。之前的研究只能保证针对单个提示生成的单个文本的鲁棒性,而我们的方案对自适应提示–一种更强的对抗模型–具有鲁棒性。

A zero-bit watermarked language model produces text that is indistinguishable from that of the underlying model, but which can be detected as machine-generated using a secret key. But merely detecting AI-generated spam, say, as watermarked may not prevent future abuses. If we could additionally trace the text to a spammer’s API token, we could then cut off their access to the model.
We introduce multi-user watermarks, which allow tracing model-generated text to individuals or to groups of colluding users. We construct multi-user watermarking schemes from undetectable zero-bit watermarking schemes. Importantly, our schemes provide both zero-bit and multi-user assurances at the same time: detecting shorter snippets as well as the original scheme and tracing longer excerpts to individuals. Along the way, we give a generic construction of a watermarking scheme that embeds long messages into generated text.
Ours are the first black-box reductions between watermarking schemes for language models. A major challenge for black-box reductions is the lack of a unified abstraction for robustness — that marked text is detectable after edits. Existing works give incomparable robustness guarantees, based on bespoke requirements on the language model’s outputs and the users’ edits. We introduce a new abstraction — AEB-robustness — to overcome this challenge. AEB-robustness provides that the watermark is detectable whenever the edited text “approximates enough blocks” of model-generated output. Specifying the robustness condition amounts to defining approximates, enough, and blocks. Using our new abstraction, we relate the robustness properties of our constructions to that of the underlying zero-bit scheme. Whereas prior works only guarantee robustness for a single text generated in response to a single prompt, our schemes are robust against adaptive prompting, a stronger adversarial model.

3.MarkLLM: An Open-Source Toolkit for LLM Watermarking(MarkLLM:用于 LLM 水印的开源工具包)

LLM 水印在模型输出中嵌入了不易察觉但可通过算法检测的信号,以识别 LLM 生成的文本,这对于减少大型语言模型的潜在滥用已变得至关重要。然而,大量的 LLM 水印算法、复杂的机制以及复杂的评估程序和视角给研究人员和社区带来了挑战,使他们难以轻松地尝试、理解和评估最新进展。为了解决这些问题,我们推出了用于 LLM 水印的开源工具包 MarkLLM。MarkLLM 为实现 LLM 水印算法提供了一个统一且可扩展的框架,同时还提供了用户友好的界面,以确保访问的便捷性。此外,它还支持这些算法底层机制的自动可视化,从而增强了对这些算法的理解。在评估方面,MarkLLM 提供了由 12 种工具组成的综合套件,涵盖三个方面,以及两种类型的自动评估管道。通过 MarkLLM,我们旨在为研究人员提供支持,同时提高公众对 LLM 水印技术的理解和参与,促进共识,推动研究和应用的进一步发展。我们的代码可在此 https URL 获取。

LLM watermarking, which embeds imperceptible yet algorithmically detectable signals in model outputs to identify LLM-generated text, has become crucial in mitigating the potential misuse of large language models. However, the abundance of LLM watermarking algorithms, their intricate mechanisms, and the complex evaluation procedures and perspectives pose challenges for researchers and the community to easily experiment with, understand, and assess the latest advancements. To address these issues, we introduce MarkLLM, an open-source toolkit for LLM watermarking. MarkLLM offers a unified and extensible framework for implementing LLM watermarking algorithms, while providing user-friendly interfaces to ensure ease of access. Furthermore, it enhances understanding by supporting automatic visualization of the underlying mechanisms of these algorithms. For evaluation, MarkLLM offers a comprehensive suite of 12 tools spanning three perspectives, along with two types of automated evaluation pipelines. Through MarkLLM, we aim to support researchers while improving the comprehension and involvement of the general public in LLM watermarking technology, fostering consensus and driving further advancements in research and application. Our code is available at this https URL.

4.Stylometric Watermarks for Large Language Models(大型语言模型的风格计量水印)

大型语言模型(LLM)的快速发展使得区分人类和机器撰写的文本变得越来越困难。为此,我们提出了一种新颖的水印生成方法,在生成过程中策略性地改变标记概率。与以往的方法不同,这种方法独特地采用了文体测量等语言特征。具体来说,我们在 LLM 中引入了 acrostica 和 sensorimotor 规范。此外,这些特征由一个密钥参数化,该密钥每句话都会更新。为了计算这个密钥,我们使用了语义 “零镜头 “分类,从而增强了复原能力。在我们的评估中,我们发现对于三个或更多句子,我们的方法实现了 0.02 的假阳性和假阴性率。对于循环翻译攻击,我们观察到七个或更多句子的类似结果。这项研究对于促进问责制和防止社会危害的专有 LLM 尤为重要。

The rapid advancement of large language models (LLMs) has made it increasingly difficult to distinguish between text written by humans and machines. Addressing this, we propose a novel method for generating watermarks that strategically alters token probabilities during generation. Unlike previous works, this method uniquely employs linguistic features such as stylometry. Concretely, we introduce acrostica and sensorimotor norms to LLMs. Further, these features are parameterized by a key, which is updated every sentence. To compute this key, we use semantic zero shot classification, which enhances resilience. In our evaluation, we find that for three or more sentences, our method achieves a false positive and false negative rate of 0.02. For the case of a cyclic translation attack, we observe similar results for seven or more sentences. This research is of particular of interest for proprietary LLMs to facilitate accountability and prevent societal harm.

5.Explanation as a Watermark: Towards Harmless and Multi-bit Model Ownership Verification via Watermarking Feature Attribution(作为水印的解释:通过水印特征归属实现无害和多位模型所有权验证)

所有权验证是目前保护模型版权最关键和最广泛采用的事后方法。一般来说,模型所有者利用这种方法,通过检查某个可疑的第三方模型是否具有从其发布的模型 “继承 “的特定属性,来识别该模型是否是从他们那里窃取的。目前,基于后门的模型水印是在已发布模型中植入此类属性的最主要和最先进的方法。然而,基于后门的方法有两个致命缺点,包括有害性和模糊性。前者是指在发布的水印模型中引入恶意可控的误分类行为(即后门)。后者表示恶意用户可以通过寻找其他错误分类样本轻松通过验证,从而导致所有权模糊。
在本文中,我们认为这两种限制都源于现有水印方案的 “零位 “性质,即利用预测的状态(即误分类)进行验证。基于这种认识,我们设计了一种新的水印范例,即 “解释即水印”(EaaW),它将验证行为植入特征归属的解释中,而不是模型预测中。具体来说,EaaW 将 “多位 “水印嵌入特定触发样本的特征归因解释中,而不改变原始预测。受可解释人工智能的启发,我们相应地设计了水印嵌入和提取算法。特别是,我们的方法可用于不同的任务(如图像分类和文本生成)。广泛的实验验证了我们的 EaaW 的有效性和无害性,以及对潜在攻击的抵御能力。

Ownership verification is currently the most critical and widely adopted post-hoc method to safeguard model copyright. In general, model owners exploit it to identify whether a given suspicious third-party model is stolen from them by examining whether it has particular properties `inherited’ from their released models. Currently, backdoor-based model watermarks are the primary and cutting-edge methods to implant such properties in the released models. However, backdoor-based methods have two fatal drawbacks, including harmfulness and ambiguity. The former indicates that they introduce maliciously controllable misclassification behaviors (i.e., backdoor) to the watermarked released models. The latter denotes that malicious users can easily pass the verification by finding other misclassified samples, leading to ownership ambiguity. 
In this paper, we argue that both limitations stem from the `zero-bit’ nature of existing watermarking schemes, where they exploit the status (i.e., misclassified) of predictions for verification. Motivated by this understanding, we design a new watermarking paradigm, i.e., Explanation as a Watermark (EaaW), that implants verification behaviors into the explanation of feature attribution instead of model predictions. Specifically, EaaW embeds a `multi-bit’ watermark into the feature attribution explanation of specific trigger samples without changing the original prediction. We correspondingly design the watermark embedding and extraction algorithms inspired by explainable artificial intelligence. In particular, our approach can be used for different tasks (e.g., image classification and text generation). Extensive experiments verify the effectiveness and harmlessness of our EaaW and its resistance to potential attacks.

6.WateRF: Robust Watermarks in Radiance Fields for Protection of Copyrights(WateRF:用于保护版权的鲁棒辐射场水印)

神经辐射场(NeRF)研究的进展为不同领域提供了广泛的应用,但对其版权保护的研究尚未深入。最近,NeRF 水印被认为是安全部署基于 NeRF 的三维表示的关键解决方案之一。然而,现有方法仅适用于隐式或显式 NeRF 表示法。在这项工作中,我们引入了一种创新的水印方法,可同时用于两种 NeRF 表示法。这是通过微调 NeRF,在渲染过程中嵌入二进制信息来实现的。具体来说,我们建议利用 NeRF 空间中的离散小波变换进行水印处理。此外,我们还采用了一种延迟反向传播技术,并引入了一种与片段式损耗相结合的方法,以最小的权衡来提高渲染质量和比特精度。我们从三个不同方面评估了我们的方法:二维渲染图像中嵌入水印的容量、隐蔽性和鲁棒性。与同类先进方法相比,我们的方法以更快的训练速度实现了最先进的性能。

The advances in the Neural Radiance Fields (NeRF) research offer extensive applications in diverse domains, but protecting their copyrights has not yet been researched in depth. Recently, NeRF watermarking has been considered one of the pivotal solutions for safely deploying NeRF-based 3D representations. However, existing methods are designed to apply only to implicit or explicit NeRF representations. In this work, we introduce an innovative watermarking method that can be employed in both representations of NeRF. This is achieved by fine-tuning NeRF to embed binary messages in the rendering process. In detail, we propose utilizing the discrete wavelet transform in the NeRF space for watermarking. Furthermore, we adopt a deferred back-propagation technique and introduce a combination with the patch-wise loss to improve rendering quality and bit accuracy with minimum trade-offs. We evaluate our method in three different aspects: capacity, invisibility, and robustness of the embedded watermarks in the 2D-rendered images. Our method achieves state-of-the-art performance with faster training speed over the compared state-of-the-art methods.

7.Are Watermarks Bugs for Deepfake Detectors? Rethinking Proactive Forensics(水印是 Deepfake 检测器的漏洞吗?反思主动取证)

人工智能生成的内容加速了媒体合成的话题,尤其是 Deepfake,它可以出于积极或恶意的目的操纵我们的肖像。在发布这些具有威胁性的人脸图像之前,一个很有前景的取证解决方案是注入稳健的水印来追踪其来源。然而,我们认为,当前的水印模型最初是为真实图像设计的,但直接应用于伪造图像时,可能会损害已部署的 Deepfake 检测器,因为水印容易与用于检测的伪造信号重叠。为了弥补这一缺陷,我们代表主动取证技术提出了 AdvMark,以充分利用被动检测器的对抗性弱点。具体来说,AdvMark 是一种即插即用的程序,可将任何稳健水印微调为对抗性水印,以提高水印图像的取证可探测性;同时,水印仍可提取用于来源跟踪。广泛的实验证明了所提出的 AdvMark 的有效性,它可以利用稳健水印来欺骗 Deepfake 检测器,这有助于提高下游 Deepfake 检测的准确性,而无需调整现场检测器。我们相信,这项工作将为针对 Deepfake 的无害主动取证带来一些启示。

AI-generated content has accelerated the topic of media synthesis, particularly Deepfake, which can manipulate our portraits for positive or malicious purposes. Before releasing these threatening face images, one promising forensics solution is the injection of robust watermarks to track their own provenance. However, we argue that current watermarking models, originally devised for genuine images, may harm the deployed Deepfake detectors when directly applied to forged images, since the watermarks are prone to overlap with the forgery signals used for detection. To bridge this gap, we thus propose AdvMark, on behalf of proactive forensics, to exploit the adversarial vulnerability of passive detectors for good. Specifically, AdvMark serves as a plug-and-play procedure for fine-tuning any robust watermarking into adversarial watermarking, to enhance the forensic detectability of watermarked images; meanwhile, the watermarks can still be extracted for provenance tracking. Extensive experiments demonstrate the effectiveness of the proposed AdvMark, leveraging robust watermarking to fool Deepfake detectors, which can help improve the accuracy of downstream Deepfake detection without tuning the in-the-wild detectors. We believe this work will shed some light on the harmless proactive forensics against Deepfake.

8.CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code(CodeIP:语法引导的大型语言代码模型多位水印)

随着大语言模型(LLM)越来越多地用于自动生成代码,人们往往希望知道代码是否是人工智能生成的,以及是由哪个模型生成的,尤其是为了保护工业领域的知识产权(IP)和防止教育领域的学术不端行为。在机器生成的内容中加入水印是提供代码出处的一种方法,但现有的解决方案仅限于单个比特或缺乏灵活性。我们介绍的 CodeIP 是一种新的水印技术,适用于基于 LLM 的代码生成。CodeIP 能够插入多位信息,同时保留生成代码的语义,提高内嵌水印的强度和多样性。这是通过训练类型预测器来预测下一个标记的后续语法类型,从而提高生成代码的语法和语义正确性。在真实世界数据集上对五种编程语言进行的实验证明了 CodeIP 的有效性。

As Large Language Models (LLMs) are increasingly used to automate code generation, it is often desired to know if the code is AI-generated and by which model, especially for purposes like protecting intellectual property (IP) in industry and preventing academic misconduct in education. Incorporating watermarks into machine-generated content is one way to provide code provenance, but existing solutions are restricted to a single bit or lack flexibility. We present CodeIP, a new watermarking technique for LLM-based code generation. CodeIP enables the insertion of multi-bit information while preserving the semantics of the generated code, improving the strength and diversity of the inerseted watermark. This is achieved by training a type predictor to predict the subsequent grammar type of the next token to enhance the syntactical and semantic correctness of the generated code. Experiments on a real-world dataset across five programming languages showcase the effectiveness of CodeIP.

9.Deep Learning-based Text-in-Image Watermarking(基于深度学习的图文并茂水印技术)

在这项工作中,我们介绍了一种基于深度学习的图像中文本水印新方法,这是一种在图像中嵌入和提取文本信息以增强数据安全性和完整性的方法。利用深度学习的能力,特别是通过使用基于变换器的架构进行文本处理和使用视觉变换器进行图像特征提取,我们的方法在该领域树立了新的标杆。所提出的方法是深度学习在图像文本水印中的首次应用,它提高了适应性,使模型能够根据特定图像特征和新出现的威胁进行智能调整。通过测试和评估,与传统的水印技术相比,我们的方法表现出了卓越的鲁棒性,实现了增强的不可感知性,确保水印在各种图像内容中都不会被检测到。

In this work, we introduce a novel deep learning-based approach to text-in-image watermarking, a method that embeds and extracts textual information within images to enhance data security and integrity. Leveraging the capabilities of deep learning, specifically through the use of Transformer-based architectures for text processing and Vision Transformers for image feature extraction, our method sets new benchmarks in the domain. The proposed method represents the first application of deep learning in text-in-image watermarking that improves adaptivity, allowing the model to intelligently adjust to specific image characteristics and emerging threats. Through testing and evaluation, our method has demonstrated superior robustness compared to traditional watermarking techniques, achieving enhanced imperceptibility that ensures the watermark remains undetectable across various image contents.

10.Topic-based Watermarks for LLM-Generated Text(基于主题的 LLM 生成文本水印)

大型语言模型(LLMs)的最新进展已经产生了与人工生成文本相当的不可分辨文本输出。水印算法是一种潜在的工具,通过在 LLM 生成的输出中嵌入可检测的签名,提供了一种区分 LLM 和人工生成文本的方法。然而,目前的水印方案对已知的针对水印算法的攻击缺乏鲁棒性。此外,考虑到 LLM 每天会生成数以万计的文本输出,而水印算法需要记住其生成的每个输出才能进行检测,因此这些方案并不实用。在这项工作中,针对当前水印方案的局限性,我们提出了针对 LLM 的 “基于主题的水印算法 “概念。建议的算法根据输入提示或非水印 LLM 输出的提取主题来决定如何为水印 LLM 输出生成标记。受以前工作的启发,我们建议使用一对列表(根据指定的提取主题生成),在生成 LLM 的水印输出时指定包含或排除某些标记。利用所提出的水印算法,我们展示了水印检测算法的实用性。此外,我们还讨论了针对 LLM 的水印算法可能出现的各种攻击,以及建议的水印方案对潜在攻击者进行建模的可行性(考虑其得失)的益处。

Recent advancements of large language models (LLMs) have resulted in indistinguishable text outputs comparable to human-generated text. Watermarking algorithms are potential tools that offer a way to differentiate between LLM- and human-generated text by embedding detectable signatures within LLM-generated output. However, current watermarking schemes lack robustness against known attacks against watermarking algorithms. In addition, they are impractical considering an LLM generates tens of thousands of text outputs per day and the watermarking algorithm needs to memorize each output it generates for the detection to work. In this work, focusing on the limitations of current watermarking schemes, we propose the concept of a “topic-based watermarking algorithm” for LLMs. The proposed algorithm determines how to generate tokens for the watermarked LLM output based on extracted topics of an input prompt or the output of a non-watermarked LLM. Inspired from previous work, we propose using a pair of lists (that are generated based on the specified extracted topic(s)) that specify certain tokens to be included or excluded while generating the watermarked output of the LLM. Using the proposed watermarking algorithm, we show the practicality of a watermark detection algorithm. Furthermore, we discuss a wide range of attacks that can emerge against watermarking algorithms for LLMs and the benefit of the proposed watermarking scheme for the feasibility of modeling a potential attacker considering its benefit vs. loss.

11.Bypassing LLM Watermarks with Color-Aware Substitutions(用色彩感知替代法绕过 LLM 水印)

有人提出了水印方法,以识别正在流传的文本是人类生成的还是大语言模型(LLM)生成的。Kirchenbauer 等人(2023a)提出的最先进的水印策略使大语言模型偏向于生成特定(”绿色”)标记。然而,确定这种水印方法的鲁棒性是一个未决问题。现有的攻击方法无法躲避较长文本片段的检测。我们克服了这一局限,提出了{/em Self Color Testing-based Substitution (SCTS)},这是第一种 “颜色感知 “攻击。SCTS 通过策略性地提示水印 LLM 并比较输出标记频率来获取颜色信息。它利用这些信息确定标记的颜色,并用非绿色标记替换绿色标记。在我们的实验中,与相关工作相比,SCTS 使用更少的编辑次数成功地躲避了水印检测。此外,我们还从理论和经验两方面证明,SCTS 可以去除任意长的水印文本的水印。

Watermarking approaches are proposed to identify if text being circulated is human or large language model (LLM) generated. The state-of-the-art watermarking strategy of Kirchenbauer et al. (2023a) biases the LLM to generate specific (“green”) tokens. However, determining the robustness of this watermarking method is an open problem. Existing attack methods fail to evade detection for longer text segments. We overcome this limitation, and propose {\em Self Color Testing-based Substitution (SCTS)}, the first “color-aware” attack. SCTS obtains color information by strategically prompting the watermarked LLM and comparing output tokens frequencies. It uses this information to determine token colors, and substitutes green tokens with non-green ones. In our experiments, SCTS successfully evades watermark detection using fewer number of edits than related work. Additionally, we show both theoretically and empirically that SCTS can remove the watermark for arbitrarily long watermarked text.

12.An Entropy-based Text Watermarking Detection Method(基于熵的文本水印检测方法)

目前,针对大型语言模型(LLM)的文本水印算法可以在 LLM 生成的文本中嵌入隐藏特征,以方便后续检测,从而缓解 LLM 被滥用的问题。虽然目前的文本水印算法在大多数高熵场景下表现良好,但在低熵场景下的性能仍有待提高。在这项工作中,我们提出在水印检测过程中应充分考虑标记熵的影响,即在水印检测过程中应根据标记熵调整每个标记的权重,而不是像以前的方法那样将所有标记的权重设置为相同的值。具体来说,我们提出了一种基于熵的水印检测(EWD)方法,即在水印检测过程中赋予熵值越高的标记越大的影响权重,从而更好地反映水印的程度。此外,所提出的检测过程无需训练,完全自动化。在实验中,我们发现我们的方法可以在低熵场景中实现更好的检测性能,而且我们的方法还具有通用性,可以应用于不同熵分布的文本。我们的代码和数据将在网上公布。

Currently, text watermarking algorithms for large language models (LLMs) can embed hidden features to texts generated by LLMs to facilitate subsequent detection, thus alleviating the problem of misuse of LLMs. Although the current text watermarking algorithms perform well in most high-entropy scenarios, its performance in low-entropy scenarios still needs to be improved. In this work, we proposed that the influence of token entropy should be fully considered in the watermark detection process, that is, the weight of each token during watermark detection should be adjusted according to its entropy, rather than setting the weights of all tokens to the same value as in previous methods. Specifically, we proposed an Entropy-based Watermark Detection (EWD) that gives higher-entropy tokens higher influence weights during watermark detection, so as to better reflect the degree of watermarking. Furthermore, the proposed detection process is training-free and fully automated. In the experiment, we found that our method can achieve better detection performance in low-entropy scenarios, and our method is also general and can be applied to texts with different entropy distributions. Our code and data will be available online.

13.Duwak: Dual Watermarks in Large Language Models(Duwak: 大语言模型中的双重水印)

随着大型语言模型(LLM)越来越多地用于文本生成任务,对其使用进行审核、管理其应用并减少其潜在危害至关重要。现有的水印技术在嵌入单一的人类无法感知和机器可检测的模式方面非常有效,而且不会对生成文本的质量和语义产生重大影响。然而,检测水印的效率,即需要最少多少标记才能确保检测的重要性和对后期编辑的鲁棒性,仍有待商榷。在本文中,我们提出了 Duwak 方案,通过在标记概率分布和采样方案中嵌入双重秘密模式,从根本上提高水印的效率和质量。为了减轻因偏向某些令牌而导致的表达能力下降,我们设计了一种对比搜索来对采样方案进行水印,从而最大限度地减少令牌重复,提高多样性。我们从理论上解释了 Duwak 中两种水印的相互依存关系。我们在 Llama2 上广泛评估了各种编辑后攻击下的 Duwak,并与四种最先进的水印技术及其组合进行了对比。我们的结果表明,Duwak 标记的文本以最低的检测所需标记数达到了最高的水印文本质量,比现有方法最多可减少 70% 的标记数,尤其是在后期仿写的情况下。

As large language models (LLM) are increasingly used for text generation tasks, it is critical to audit their usages, govern their applications, and mitigate their potential harms. Existing watermark techniques are shown effective in embedding single human-imperceptible and machine-detectable patterns without significantly affecting generated text quality and semantics. However, the efficiency in detecting watermarks, i.e., the minimum number of tokens required to assert detection with significance and robustness against post-editing, is still debatable. In this paper, we propose, Duwak, to fundamentally enhance the efficiency and quality of watermarking by embedding dual secret patterns in both token probability distribution and sampling schemes. To mitigate expression degradation caused by biasing toward certain tokens, we design a contrastive search to watermark the sampling scheme, which minimizes the token repetition and enhances the diversity. We theoretically explain the interdependency of the two watermarks within Duwak. We evaluate Duwak extensively on Llama2 under various post-editing attacks, against four state-of-the-art watermarking techniques and combinations of them. Our results show that Duwak marked text achieves the highest watermarked text quality at the lowest required token count for detection, up to 70% tokens less than existing approaches, especially under post paraphrasing.

14.Optimizing watermarks for large language models(优化大型语言模型的水印)

随着大型语言模型(LLM)的兴起以及对潜在滥用的担忧,生成式 LLM 的水印最近引起了广泛关注。此类水印的一个重要方面是其可识别性和对生成文本质量的影响之间的权衡。本文介绍了一种针对多目标优化问题进行权衡的系统方法。对于一大类稳健、高效的水印,相关的帕累托最优解被识别并显示出优于当前默认水印。

With the rise of large language models (LLMs) and concerns about potential misuse, watermarks for generative LLMs have recently attracted much attention. An important aspect of such watermarks is the trade-off between their identifiability and their impact on the quality of the generated text. This paper introduces a systematic approach to this trade-off in terms of a multi-objective optimization problem. For a large class of robust, efficient watermarks, the associated Pareto optimal solutions are identified and shown to outperform the currently default watermark.

15.Cross-Attention Watermarking of Large Language Models(大型语言模型的交叉注意力水印)

提出了一种新的语言模型语言水印方法,其中信息不知不觉地插入到输出文本中,同时保留其可读性和原始含义。交叉注意力机制用于在推理过程中在文本中嵌入水印。提出了两种使用交叉注意力的方法,可以最大限度地减少水印对预训练模型性能的影响。对优化水印的不同训练策略的探索以及在现实场景中应用这种方法的挑战和影响阐明了水印鲁棒性和文本质量之间的权衡。水印选择很大程度上影响高熵句子的生成输出。这种主动的水印方法在未来的模型开发中具有潜在的应用。

A new approach to linguistic watermarking of language models is presented in which information is imperceptibly inserted into the output text while preserving its readability and original meaning. A cross-attention mechanism is used to embed watermarks in the text during inference. Two methods using cross-attention are presented that minimize the effect of watermarking on the performance of a pretrained model. Exploration of different training strategies for optimizing the watermarking and of the challenges and implications of applying this approach in real-world scenarios clarified the tradeoff between watermark robustness and text quality. Watermark selection substantially affects the generated output for high entropy sentences. This proactive watermarking approach has potential application in future model development.

16.Can Watermarks Survive Translation? On the Cross-lingual Consistency of Text Watermark for Large Language Models(水印可以翻译吗?大语言模型文本水印跨语言一致性研究)

文本水印技术旨在标记和识别大型语言模型 (LLM) 生成的内容,以防止滥用。在本研究中,我们在文本水印中引入了“跨语言一致性”的概念,它评估文本水印在翻译成其他语言后保持其有效性的能力。两个法学硕士和三种水印方法的初步实证结果表明,当前的文本水印技术在文本翻译成各种语言时缺乏一致性。基于这一观察,我们提出了一种跨语言水印去除攻击(CWRA)来绕过水印,首先从主语言的法学硕士那里获得响应,然后将其翻译成目标语言。 CWRA 可以通过将曲线下面积 (AUC) 从 0.95 降低到 0.67 来有效去除水印,而不会造成性能损失。此外,我们分析了有助于文本水印跨语言一致性的两个关键因素,并提出了一种防御方法,将 CWRA 下的 AUC 从 0.67 提高到 0.88。

Text watermarking technology aims to tag and identify content produced by large language models (LLMs) to prevent misuse. In this study, we introduce the concept of ”cross-lingual consistency” in text watermarking, which assesses the ability of text watermarks to maintain their effectiveness after being translated into other languages. Preliminary empirical results from two LLMs and three watermarking methods reveal that current text watermarking technologies lack consistency when texts are translated into various languages. Based on this observation, we propose a Cross-lingual Watermark Removal Attack (CWRA) to bypass watermarking by first obtaining a response from an LLM in a pivot language, which is then translated into the target language. CWRA can effectively remove watermarks by reducing the Area Under the Curve (AUC) from 0.95 to 0.67 without performance loss. Furthermore, we analyze two key factors that contribute to the cross-lingual consistency in text watermarking and propose a defense method that increases the AUC from 0.67 to 0.88 under CWRA.

17.WaterMax: breaking the LLM watermark detectability-robustness-quality trade-off(WaterMax:打破LLM水印可检测性-稳健性-质量的权衡)

水印是阻止大型语言模型的不当使用的一种技术手段。本文提出了一种新颖的水印方案,即所谓的 WaterMax,该方案具有高可检测性,同时保持原始 LLM 生成文本的质量。其新设计使法学硕士保持不变(未修改权重、逻辑、温度或采样技术)。 WaterMax 平衡了鲁棒性和复杂性,这与文献中本质上引起质量和鲁棒性之间权衡的水印技术相反。其性能已得到理论证明和实验验证。在最完整的基准测试套件下,它的性能优于所有 SotA 技术。

Watermarking is a technical means to dissuade malfeasant usage of Large Language Models. This paper proposes a novel watermarking scheme, so-called WaterMax, that enjoys high detectability while sustaining the quality of the generated text of the original LLM. Its new design leaves the LLM untouched (no modification of the weights, logits, temperature, or sampling technique). WaterMax balances robustness and complexity contrary to the watermarking techniques of the literature inherently provoking a trade-off between quality and robustness. Its performance is both theoretically proven and experimentally validated. It outperforms all the SotA techniques under the most complete benchmark suite.

18.Lost in Overlap: Exploring Watermark Collision in LLMs(迷失在重叠中:探索大语言模型中的水印碰撞)

大型语言模型(LLM)在内容生成中的激增引起了人们对文本版权的担忧。水印方法,特别是基于逻辑的方法,将难以察觉的标识符嵌入文本中以应对这些挑战。然而,水印在不同的法学硕士中的广泛使用导致了在问答和释义等常见任务中不可避免的问题,称为水印冲突。本研究重点关注双水印碰撞,即同一文本中同时存在两个水印。研究表明,水印冲突对上游和下游水印算法的检测器的检测性能构成威胁。

The proliferation of large language models (LLMs) in generating content raises concerns about text copyright. Watermarking methods, particularly logit-based approaches, embed imperceptible identifiers into text to address these challenges. However, the widespread use of watermarking across diverse LLMs has led to an inevitable issue known as watermark collision during common tasks like question answering and paraphrasing. This study focuses on dual watermark collisions, where two watermarks are present simultaneously in the same text. The research demonstrates that watermark collision poses a threat to detection performance for detectors of both upstream and downstream watermark algorithms.

19.Learning to Watermark LLM-generated Text via Reinforcement Learning(学习通过强化学习为大语言模型生成的文本添加水印)

我们研究如何对 LLM 输出添加水印,即将算法可检测信号嵌入到 LLM 生成的文本中以跟踪滥用情况。与当前使用固定 LLM 的主流方法不同,我们通过在水印管道中包含 LLM 调整阶段来扩展水印设计空间。虽然之前的工作主要关注将信号嵌入到输出中的令牌级水印,但我们设计了一种将信号嵌入到 LLM 权重中的模型级水印,并且此类信号可以由配对检测器检测到。我们提出了一种基于强化学习的协同训练框架,该框架迭代地(1)训练检测器以检测生成的带水印文本,以及(2)调整 LLM 以生成检测器易于检测的文本,同时保持其正常实用性。我们的经验表明,我们的水印更加准确、稳健且适应性强(针对新的攻击)。它还允许带水印的模型开源。此外,如果与对齐一起使用,引入的额外开销很低——只需训练额外的奖励模型(即我们的检测器)。我们希望我们的工作能够为研究更广泛的水印设计带来更多的努力,而不仅仅是与固定的法学硕士合作。我们开源了代码:这个 https URL 。

We study how to watermark LLM outputs, i.e. embedding algorithmically detectable signals into LLM-generated text to track misuse. Unlike the current mainstream methods that work with a fixed LLM, we expand the watermark design space by including the LLM tuning stage in the watermark pipeline. While prior works focus on token-level watermark that embeds signals into the output, we design a model-level watermark that embeds signals into the LLM weights, and such signals can be detected by a paired detector. We propose a co-training framework based on reinforcement learning that iteratively (1) trains a detector to detect the generated watermarked text and (2) tunes the LLM to generate text easily detectable by the detector while keeping its normal utility. We empirically show that our watermarks are more accurate, robust, and adaptable (to new attacks). It also allows watermarked model open-sourcing. In addition, if used together with alignment, the extra overhead introduced is low – only training an extra reward model (i.e. our detector). We hope our work can bring more effort into studying a broader watermark design that is not limited to working with a fixed LLM. We open-source the code: this https URL .

20.Provably Robust Multi-bit Watermarking for AI-generated Text via Error Correction Code(通过纠错码为人工智能生成的文本提供可靠的多位水印)

大型语言模型(LLM)因其生成类似于人类语言的文本的卓越能力而被广泛部署。然而,它们可能被犯罪分子滥用来创建欺骗性内容,例如假新闻和网络钓鱼电子邮件,这引发了道德问题。水印是减少 LLM 滥用的一项关键技术,它将水印(例如,位字符串)嵌入到 LLM 生成的文本中。因此,这使得能够检测法学硕士生成的文本以及将生成的文本追踪到特定用户。现有水印技术的主要限制是它们不能准确或有效地从文本中提取水印,特别是当水印是长比特串时。这一关键限制阻碍了它们在实际应用中的部署,例如将生成的文本跟踪到特定用户。
这项工作引入了一种基于 \textbf{纠错码} 的 LLM 生成文本的新颖水印方法,以应对这一挑战。我们提供了强有力的理论分析,证明在有界对抗性单词/令牌编辑(插入、删除和替换)下,我们的方法可以正确提取水印,提供可证明的稳健性保证。我们广泛的实验结果也证明了这一突破。实验表明,我们的方法在基准数据集上的准确性和鲁棒性方面都远远优于现有基线。例如,当将长度为 12 的位串嵌入到由 200 个标记生成的文本中时,我们的方法获得了 98.4% 的令人印象深刻的匹配率,超过了 Yoo 等人的性能。 (最先进的基线)为 85.6%。当受到涉及向生成的 200 个单词的文本注入 50 个标记的复制粘贴攻击时,我们的方法保持了 90.8% 的匹配率,而 Yoo 等人的匹配率则保持在 90.8%。减少到65%以下。

Large Language Models (LLMs) have been widely deployed for their remarkable capability to generate texts resembling human language. However, they could be misused by criminals to create deceptive content, such as fake news and phishing emails, which raises ethical concerns. Watermarking is a key technique to mitigate the misuse of LLMs, which embeds a watermark (e.g., a bit string) into a text generated by a LLM. Consequently, this enables the detection of texts generated by a LLM as well as the tracing of generated texts to a specific user. The major limitation of existing watermark techniques is that they cannot accurately or efficiently extract the watermark from a text, especially when the watermark is a long bit string. This key limitation impedes their deployment for real-world applications, e.g., tracing generated texts to a specific user.
This work introduces a novel watermarking method for LLM-generated text grounded in \textbf{error-correction codes} to address this challenge. We provide strong theoretical analysis, demonstrating that under bounded adversarial word/token edits (insertion, deletion, and substitution), our method can correctly extract watermarks, offering a provable robustness guarantee. This breakthrough is also evidenced by our extensive experimental results. The experiments show that our method substantially outperforms existing baselines in both accuracy and robustness on benchmark datasets. For instance, when embedding a bit string of length 12 into a 200-token generated text, our approach attains an impressive match rate of 98.4%, surpassing the performance of Yoo et al. (state-of-the-art baseline) at 85.6%. When subjected to a copy-paste attack involving the injection of 50 tokens to generated texts with 200 words, our method maintains a substantial match rate of 90.8%, while the match rate of Yoo et al. diminishes to below 65%.

21.Three Bricks to Consolidate Watermarks for Large Language Models(巩固大型语言模型水印的三块砖)

区分生成文本和自然文本的任务越来越具有挑战性。在这种情况下,水印作为一种有前途的技术而出现,用于将生成的文本归因于特定模型。它改变了采样生成过程,以便在生成的输出中留下不可见的痕迹,以便于以后的检测。这项研究基于三个理论和实证考虑整合了大型语言模型的水印。首先,我们引入了新的统计测试,这些测试提供了强大的理论保证,即使在假阳性率较低(低于 10-6)的情况下,这些保证仍然有效。其次,我们使用自然语言处理领域的经典基准来比较水印的有效性,深入了解它们的现实世界适用性。第三,我们为可以访问 LLM 的场景以及多位水印开发了先进的检测方案。

The task of discerning between generated and natural texts is increasingly challenging. In this context, watermarking emerges as a promising technique for ascribing generated text to a specific model. It alters the sampling generation process so as to leave an invisible trace in the generated output, facilitating later detection. This research consolidates watermarks for large language models based on three theoretical and empirical considerations. First, we introduce new statistical tests that offer robust theoretical guarantees which remain valid even at low false-positive rates (less than 10-6). Second, we compare the effectiveness of watermarks using classical benchmarks in the field of natural language processing, gaining insights into their real-world applicability. Third, we develop advanced detection schemes for scenarios where access to the LLM is available, as well as multi-bit watermarking.

22.Watermarking Generative Tabular Data(生成表格数据加水印)

在本文中,我们介绍了一种简单而有效的具有统计保证的表格数据水印机制。我们从理论上证明,所提出的水印可以被有效检测,同时忠实地保持数据保真度,并且还表现出针对加性噪声攻击的有吸引力的鲁棒性。总体思路是通过基于简单数据分箱的策略嵌入来实现水印。具体来说,它将特征的值范围划分为精细分段的区间,并将水印嵌入到选定的“绿名单”区间中。为了检测水印,我们开发了一个原则性的统计假设检验框架,其中的假设最少:只要基础数据存在,它就保持有效。数据分布具有连续的密度函数。通过严格的理论分析和实证验证证明了水印功效,突出了其在增强合成和现实数据集的安全性方面的实用性。

In this paper, we introduce a simple yet effective tabular data watermarking mechanism with statistical guarantees. We show theoretically that the proposed watermark can be effectively detected, while faithfully preserving the data fidelity, and also demonstrates appealing robustness against additive noise attack. The general idea is to achieve the watermarking through a strategic embedding based on simple data binning. Specifically, it divides the feature’s value range into finely segmented intervals and embeds watermarks into selected “green list” intervals. To detect the watermarks, we develop a principled statistical hypothesis-testing framework with minimal assumptions: it remains valid as long as the underlying data distribution has a continuous density function. The watermarking efficacy is demonstrated through rigorous theoretical analysis and empirical validation, highlighting its utility in enhancing the security of synthetic and real-world datasets.

23.A Watermark for Low-entropy and Unbiased Generation in Large Language Models(大型语言模型中低熵和无偏生成的水印)

大型语言模型 (LLM) 的最新进展凸显了滥用的风险,引发了人们对准确检测 LLM 生成的内容的担忧。检测问题的一个可行解决方案是将难以察觉的标识符注入 LLM,称为水印。之前的工作表明,无偏水印通过维持 LLM 输出概率分布的期望来确保不可伪造性并保持文本质量。然而,以前的无偏水印方法对于本地部署来说是不切实际的,因为它们依赖于对白盒LLM的访问和检测期间的输入提示。而且,这些方法无法为水印检测的II类错误提供统计保证。本研究提出了先采样后接受(STA-1)方法,这是一种无偏水印,不需要访问LLM,也不需要在检测过程中进行提示,并且对II类错误有统计保证。此外,我们提出了无偏水印中水印强度和文本质量之间的新颖权衡。我们表明,在低熵场景中,无偏水印面临水印强度和输出不令人满意的风险之间的权衡。低熵和高熵数据集上的实验结果表明,STA-1 实现了与现有无偏水印相当的文本质量和水印强度,并且输出不令人满意的风险较低。本研究的实施代码可在线获取。

Recent advancements in large language models (LLMs) have highlighted the risk of misuse, raising concerns about accurately detecting LLM-generated content. A viable solution for the detection problem is to inject imperceptible identifiers into LLMs, known as watermarks. Previous work demonstrates that unbiased watermarks ensure unforgeability and preserve text quality by maintaining the expectation of the LLM output probability distribution. However, previous unbiased watermarking methods are impractical for local deployment because they rely on accesses to white-box LLMs and input prompts during detection. Moreover, these methods fail to provide statistical guarantees for the type II error of watermark detection. This study proposes the Sampling One Then Accepting (STA-1) method, an unbiased watermark that does not require access to LLMs nor prompts during detection and has statistical guarantees for the type II error. Moreover, we propose a novel tradeoff between watermark strength and text quality in unbiased watermarks. We show that in low-entropy scenarios, unbiased watermarks face a tradeoff between watermark strength and the risk of unsatisfactory outputs. Experimental results on low-entropy and high-entropy datasets demonstrate that STA-1 achieves text quality and watermark strength comparable to existing unbiased watermarks, with a low risk of unsatisfactory outputs. Implementation codes for this study are available online.

发表回复

您的电子邮箱地址不会被公开。 必填项已用*标注