AI Agents 相关度: 9/10

Learning to Inject: Automated Prompt Injection via Reinforcement Learning

Xin Chen, Jie Zhang, Florian Tramer

arXiv: 2602.05746v1 发布: 2026-02-05 更新: 2026-02-05

下载 PDF arXiv 页面

AI 摘要

提出AutoInject框架，利用强化学习自动生成Prompt Injection攻击，提升LLM安全性评估。

主要贡献

提出基于强化学习的自动化Prompt Injection方法AutoInject
能够在黑盒条件下攻击多种LLM，包括GPT和Claude
在AgentDojo基准测试上建立了更强的Prompt Injection基线

方法论

使用强化学习框架，训练一个生成对抗性后缀的生成器，联合优化攻击成功率和良性任务的效用保持。

原文摘要

Prompt injection is one of the most critical vulnerabilities in LLM agents; yet, effective automated attacks remain largely unexplored from an optimization perspective. Existing methods heavily depend on human red-teamers and hand-crafted prompts, limiting their scalability and adaptability. We propose AutoInject, a reinforcement learning framework that generates universal, transferable adversarial suffixes while jointly optimizing for attack success and utility preservation on benign tasks. Our black-box method supports both query-based optimization and transfer attacks to unseen models and tasks. Using only a 1.5B parameter adversarial suffix generator, we successfully compromise frontier systems including GPT 5 Nano, Claude Sonnet 3.5, and Gemini 2.5 Flash on the AgentDojo benchmark, establishing a stronger baseline for automated prompt injection research.

arXiv 分类

cs.LG cs.AI

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类