AI Agents 相关度: 9/10

Recursive language models for jailbreak detection: a procedural defense for tool-augmented agents

Doron Shavit

arXiv: 2602.16520v1 发布: 2026-02-18 更新: 2026-02-18

下载 PDF arXiv 页面

AI 摘要

RLM-JB是一种基于递归语言模型的端到端Jailbreak检测框架，有效防御工具增强型Agent的攻击。

主要贡献

提出RLM-JB框架，用于检测LLM的Jailbreak攻击
利用递归语言模型进行输入分析和处理
在多个LLM后端上验证了框架的有效性

方法论

构建递归语言模型，将检测视为一个程序，通过输入转换、分块、并行筛选和信号组合来检测攻击。

原文摘要

Jailbreak prompts are a practical and evolving threat to large language models (LLMs), particularly in agentic systems that execute tools over untrusted content. Many attacks exploit long-context hiding, semantic camouflage, and lightweight obfuscations that can evade single-pass guardrails. We present RLM-JB, an end-to-end jailbreak detection framework built on Recursive Language Models (RLMs), in which a root model orchestrates a bounded analysis program that transforms the input, queries worker models over covered segments, and aggregates evidence into an auditable decision. RLM-JB treats detection as a procedure rather than a one-shot classification: it normalizes and de-obfuscates suspicious inputs, chunks text to reduce context dilution and guarantee coverage, performs parallel chunk screening, and composes cross-chunk signals to recover split-payload attacks. On AutoDAN-style adversarial inputs, RLM-JB achieves high detection effectiveness across three LLM backends (ASR/Recall 92.5-98.0%) while maintaining very high precision (98.99-100%) and low false positive rates (0.0-2.0%), highlighting a practical sensitivity-specificity trade-off as the screening backend changes.

arXiv 分类

cs.CR cs.AI

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类