AI Agents 相关度: 7/10

Towards Poisoning Robustness Certification for Natural Language Generation

Mihnea Ghitu, Matthew Wicker

arXiv: 2602.09757v1 发布: 2026-02-10 更新: 2026-02-10

下载 PDF arXiv 页面

AI 摘要

提出一种针对自然语言生成任务的认证对抗样本防御框架，保障语言模型在安全敏感领域的可靠性。

主要贡献

形式化定义了自然语言生成的稳定性和有效性安全属性
提出了Targeted Partition Aggregation (TPA) 算法，用于认证靶向攻击
利用混合整数线性规划 (MILP) 改进了多轮生成保证

方法论

通过计算诱导特定有害类别所需的最小中毒预算，来认证自然语言生成模型对靶向攻击的鲁棒性。

原文摘要

Understanding the reliability of natural language generation is critical for deploying foundation models in security-sensitive domains. While certified poisoning defenses provide provable robustness bounds for classification tasks, they are fundamentally ill-equipped for autoregressive generation: they cannot handle sequential predictions or the exponentially large output space of language models. To establish a framework for certified natural language generation, we formalize two security properties: stability (robustness to any change in generation) and validity (robustness to targeted, harmful changes in generation). We introduce Targeted Partition Aggregation (TPA), the first algorithm to certify validity/targeted attacks by computing the minimum poisoning budget needed to induce a specific harmful class, token, or phrase. Further, we extend TPA to provide tighter guarantees for multi-turn generations using mixed integer linear programming (MILP). Empirically, we demonstrate TPA's effectiveness across diverse settings including: certifying validity of agent tool-calling when adversaries modify up to 0.5% of the dataset and certifying 8-token stability horizons in preference-based alignment. Though inference-time latency remains an open challenge, our contributions enable certified deployment of language models in security-critical applications.

arXiv 分类

cs.LG

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类