AI Agents 相关度: 9/10

Learning to Inject: Automated Prompt Injection via Reinforcement Learning

Xin Chen, Jie Zhang, Florian Tramer
arXiv: 2602.05746v1 发布: 2026-02-05 更新: 2026-02-05

AI 摘要

提出AutoInject框架,利用强化学习自动生成Prompt Injection攻击,提升LLM安全性评估。

主要贡献

  • 提出基于强化学习的自动化Prompt Injection方法AutoInject
  • 能够在黑盒条件下攻击多种LLM,包括GPT和Claude
  • 在AgentDojo基准测试上建立了更强的Prompt Injection基线

方法论

使用强化学习框架,训练一个生成对抗性后缀的生成器,联合优化攻击成功率和良性任务的效用保持。

原文摘要

Prompt injection is one of the most critical vulnerabilities in LLM agents; yet, effective automated attacks remain largely unexplored from an optimization perspective. Existing methods heavily depend on human red-teamers and hand-crafted prompts, limiting their scalability and adaptability. We propose AutoInject, a reinforcement learning framework that generates universal, transferable adversarial suffixes while jointly optimizing for attack success and utility preservation on benign tasks. Our black-box method supports both query-based optimization and transfer attacks to unseen models and tasks. Using only a 1.5B parameter adversarial suffix generator, we successfully compromise frontier systems including GPT 5 Nano, Claude Sonnet 3.5, and Gemini 2.5 Flash on the AgentDojo benchmark, establishing a stronger baseline for automated prompt injection research.

标签

Prompt Injection 强化学习 LLM安全 对抗攻击

arXiv 分类

cs.LG cs.AI