AI Agents 相关度: 9/10

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

Nivya Talokar, Ayush K Tarun, Murari Mandal, Maksym Andriushchenko, Antoine Bosselut
arXiv: 2602.16346v1 发布: 2026-02-18 更新: 2026-02-18

AI 摘要

STING框架用于评估多轮多语言LLM Agent的非法辅助能力,发现现有方法不足,并提出改进。

主要贡献

  • 提出了STING框架,用于自动化评估多轮LLM Agent的非法辅助能力。
  • 引入了分析框架,将多轮红队测试建模为时间-越狱事件,并提出了RMD指标。
  • 多语言评估表明,攻击成功率和任务完成度不一定随低资源语言而增加。

方法论

构建基于良性角色设定的逐步非法计划,迭代探测Agent,使用judge agents跟踪完成情况,并进行统计分析。

原文摘要

LLM-based agents execute real-world workflows via tools and memory. These affordances enable ill-intended adversaries to also use these agents to carry out complex misuse scenarios. Existing agent misuse benchmarks largely test single-prompt instructions, leaving a gap in measuring how agents end up helping with harmful or illegal tasks over multiple turns. We introduce STING (Sequential Testing of Illicit N-step Goal execution), an automated red-teaming framework that constructs a step-by-step illicit plan grounded in a benign persona and iteratively probes a target agent with adaptive follow-ups, using judge agents to track phase completion. We further introduce an analysis framework that models multi-turn red-teaming as a time-to-first-jailbreak random variable, enabling analysis tools like discovery curves, hazard-ratio attribution by attack language, and a new metric: Restricted Mean Jailbreak Discovery. Across AgentHarm scenarios, STING yields substantially higher illicit-task completion than single-turn prompting and chat-oriented multi-turn baselines adapted to tool-using agents. In multilingual evaluations across six non-English settings, we find that attack success and illicit-task completion do not consistently increase in lower-resource languages, diverging from common chatbot findings. Overall, STING provides a practical way to evaluate and stress-test agent misuse in realistic deployment settings, where interactions are inherently multi-turn and often multilingual.

标签

LLM Agents Red Teaming Multilingual Security

arXiv 分类

cs.CL cs.LG