AI Agents 相关度: 10/10

ATBench: A Diverse and Realistic Trajectory Benchmark for Long-Horizon Agent Safety

Yu Li, Haoyu Luo, Yuejin Xie, Yuqian Fu, Zhonghao Yang, Shuai Shao, Qihan Ren, Wanying Qu, Yanwei Fu, Yujiu Yang, Jing Shao, Xia Hu, Dongrui Liu
arXiv: 2604.02022v1 发布: 2026-04-02 更新: 2026-04-02

AI 摘要

ATBench是一个评估LLM Agent安全性的轨迹级基准,具有多样性和长程真实性。

主要贡献

  • 构建了包含风险来源、失败模式和真实危害的多维度Agent风险分类体系
  • 提出了长上下文延迟触发协议,模拟了真实风险的出现
  • 创建了包含1000条轨迹的ATBench数据集,用于评估LLM Agent的安全性

方法论

构建多样化工具池和长上下文延迟触发协议,生成包含安全和不安全轨迹的数据集,并通过规则、LLM和人工审核保证数据质量。

原文摘要

Evaluating the safety of LLM-based agents is increasingly important because risks in realistic deployments often emerge over multi-step interactions rather than isolated prompts or final responses. Existing trajectory-level benchmarks remain limited by insufficient interaction diversity, coarse observability of safety failures, and weak long-horizon realism. We introduce ATBench, a trajectory-level benchmark for structured, diverse, and realistic evaluation of agent safety. ATBench organizes agentic risk along three dimensions: risk source, failure mode, and real-world harm. Based on this taxonomy, we construct trajectories with heterogeneous tool pools and a long-context delayed-trigger protocol that captures realistic risk emergence across multiple stages. The benchmark contains 1,000 trajectories (503 safe and 497 unsafe), averaging 9.01 turns and 3.95k tokens, with 1,954 invoked tools drawn from pools spanning 2,084 available tools. Data quality is supported by rule-based and LLM-based filtering plus full human audit. Experiments on frontier LLMs, open-source models, and specialized guard systems show that ATBench is challenging even for strong evaluators, while enabling taxonomy-stratified analysis, cross-benchmark comparison, and diagnosis of long-horizon failure patterns.

标签

AI Agents Safety Benchmark LLM

arXiv 分类

cs.AI