Recursive language models for jailbreak detection: a procedural defense for tool-augmented agents
AI 摘要
RLM-JB是一种基于递归语言模型的端到端Jailbreak检测框架,有效防御工具增强型Agent的攻击。
主要贡献
- 提出RLM-JB框架,用于检测LLM的Jailbreak攻击
- 利用递归语言模型进行输入分析和处理
- 在多个LLM后端上验证了框架的有效性
方法论
构建递归语言模型,将检测视为一个程序,通过输入转换、分块、并行筛选和信号组合来检测攻击。
原文摘要
Jailbreak prompts are a practical and evolving threat to large language models (LLMs), particularly in agentic systems that execute tools over untrusted content. Many attacks exploit long-context hiding, semantic camouflage, and lightweight obfuscations that can evade single-pass guardrails. We present RLM-JB, an end-to-end jailbreak detection framework built on Recursive Language Models (RLMs), in which a root model orchestrates a bounded analysis program that transforms the input, queries worker models over covered segments, and aggregates evidence into an auditable decision. RLM-JB treats detection as a procedure rather than a one-shot classification: it normalizes and de-obfuscates suspicious inputs, chunks text to reduce context dilution and guarantee coverage, performs parallel chunk screening, and composes cross-chunk signals to recover split-payload attacks. On AutoDAN-style adversarial inputs, RLM-JB achieves high detection effectiveness across three LLM backends (ASR/Recall 92.5-98.0%) while maintaining very high precision (98.99-100%) and low false positive rates (0.0-2.0%), highlighting a practical sensitivity-specificity trade-off as the screening backend changes.