Learning to Inject: Automated Prompt Injection via Reinforcement Learning
AI 摘要
提出AutoInject框架,利用强化学习自动生成Prompt Injection攻击,提升LLM安全性评估。
主要贡献
- 提出基于强化学习的自动化Prompt Injection方法AutoInject
- 能够在黑盒条件下攻击多种LLM,包括GPT和Claude
- 在AgentDojo基准测试上建立了更强的Prompt Injection基线
方法论
使用强化学习框架,训练一个生成对抗性后缀的生成器,联合优化攻击成功率和良性任务的效用保持。
原文摘要
Prompt injection is one of the most critical vulnerabilities in LLM agents; yet, effective automated attacks remain largely unexplored from an optimization perspective. Existing methods heavily depend on human red-teamers and hand-crafted prompts, limiting their scalability and adaptability. We propose AutoInject, a reinforcement learning framework that generates universal, transferable adversarial suffixes while jointly optimizing for attack success and utility preservation on benign tasks. Our black-box method supports both query-based optimization and transfer attacks to unseen models and tasks. Using only a 1.5B parameter adversarial suffix generator, we successfully compromise frontier systems including GPT 5 Nano, Claude Sonnet 3.5, and Gemini 2.5 Flash on the AgentDojo benchmark, establishing a stronger baseline for automated prompt injection research.