Agent Tuning & Optimization 相关度: 9/10

Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

Aradhye Agarwal, Gurdit Siyan, Yash Pandya, Joykirat Singh, Akshay Nambi, Ahmed Awadallah
arXiv: 2603.03205v1 发布: 2026-03-03 更新: 2026-03-03

AI 摘要

MOSAIC框架通过显式安全决策,提升Agent在多步工具使用中的安全性,降低有害行为。

主要贡献

  • 提出了MOSAIC框架,显式地进行安全决策
  • 使用基于偏好的强化学习训练,无需轨迹级别标签
  • 在多个模型和基准测试中验证了MOSAIC的有效性

方法论

采用后训练框架MOSAIC,通过计划-检查-行动/拒绝循环进行推理,使用基于偏好的强化学习进行训练。

原文摘要

Agentic language models operate in a fundamentally different safety regime than chat models: they must plan, call tools, and execute long-horizon actions where a single misstep, such as accessing files or entering credentials, can cause irreversible harm. Existing alignment methods, largely optimized for static generation and task completion, break down in these settings due to sequential decision-making, adversarial tool feedback, and overconfident intermediate reasoning. We introduce MOSAIC, a post-training framework that aligns agents for safe multi-step tool use by making safety decisions explicit and learnable. MOSAIC structures inference as a plan, check, then act or refuse loop, with explicit safety reasoning and refusal as first-class actions. To train without trajectory-level labels, we use preference-based reinforcement learning with pairwise trajectory comparisons, which captures safety distinctions often missed by scalar rewards. We evaluate MOSAIC zero-shot across three model families, Qwen2.5-7B, Qwen3-4B-Thinking, and Phi-4, and across out-of-distribution benchmarks spanning harmful tasks, prompt injection, benign tool use, and cross-domain privacy leakage. MOSAIC reduces harmful behavior by up to 50%, increases harmful-task refusal by over 20% on injection attacks, cuts privacy leakage, and preserves or improves benign task performance, demonstrating robust generalization across models, domains, and agentic settings.

标签

AI Agent 安全 强化学习 工具使用 对齐

arXiv 分类

cs.CL