Towards Poisoning Robustness Certification for Natural Language Generation
AI 摘要
提出一种针对自然语言生成任务的认证对抗样本防御框架,保障语言模型在安全敏感领域的可靠性。
主要贡献
- 形式化定义了自然语言生成的稳定性和有效性安全属性
- 提出了Targeted Partition Aggregation (TPA) 算法,用于认证靶向攻击
- 利用混合整数线性规划 (MILP) 改进了多轮生成保证
方法论
通过计算诱导特定有害类别所需的最小中毒预算,来认证自然语言生成模型对靶向攻击的鲁棒性。
原文摘要
Understanding the reliability of natural language generation is critical for deploying foundation models in security-sensitive domains. While certified poisoning defenses provide provable robustness bounds for classification tasks, they are fundamentally ill-equipped for autoregressive generation: they cannot handle sequential predictions or the exponentially large output space of language models. To establish a framework for certified natural language generation, we formalize two security properties: stability (robustness to any change in generation) and validity (robustness to targeted, harmful changes in generation). We introduce Targeted Partition Aggregation (TPA), the first algorithm to certify validity/targeted attacks by computing the minimum poisoning budget needed to induce a specific harmful class, token, or phrase. Further, we extend TPA to provide tighter guarantees for multi-turn generations using mixed integer linear programming (MILP). Empirically, we demonstrate TPA's effectiveness across diverse settings including: certifying validity of agent tool-calling when adversaries modify up to 0.5% of the dataset and certifying 8-token stability horizons in preference-based alignment. Though inference-time latency remains an open challenge, our contributions enable certified deployment of language models in security-critical applications.