AI Agents 相关度: 9/10

When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation

Henry Peng Zou, Chunyu Miao, Wei-Chieh Huang, Yankai Chen, Yue Zhou, Hanrong Zhang, Yaozu Wu, Liancheng Fang, Zhengyao Gu, Zhen Zhang, Kening Zheng, Fangxin Wang, Yi Nian, Shanghao Li, Wenzhe Fan, Langzhou He, Weizhi Zhang, Xue Liu, Philip S. Yu
arXiv: 2604.00892v1 发布: 2026-04-01 更新: 2026-04-01

AI 摘要

研究长时程Web导航中LLM Agent处理用户中断的能力,提出InterruptBench基准。

主要贡献

  • 提出三种现实中断类型:添加、修改和撤回
  • 构建InterruptBench基准,评估Agent在长时程Web导航中的中断处理能力
  • 分析了现有LLM在处理中断任务中的有效性和效率

方法论

构建基于WebArena-Lite的InterruptBench基准,通过统一的中断模拟框架评估不同LLM在单轮和多轮中断场景下的表现。

原文摘要

As LLM agents transition from short, static problem solving to executing complex, long-horizon tasks in dynamic environments, the ability to handle user interruptions, such as adding requirement or revising goals, during mid-task execution is becoming a core requirement for realistic deployment. However, existing benchmarks largely assume uninterrupted agent behavior or study interruptions only in short, unconstrained language tasks. In this paper, we present the first systematic study of interruptible agents in long-horizon, environmentally grounded web navigation tasks, where actions induce persistent state changes. We formalize three realistic interruption types, including addition, revision, and retraction, and introduce InterruptBench, a benchmark derived from WebArena-Lite that synthesizes high-quality interruption scenarios under strict semantic constraints. Using a unified interruption simulation framework, we evaluate six strong LLM backbones across single- and multi-turn interruption settings, analyzing both their effectiveness in adapting to updated intents and their efficiency in recovering from mid-task changes. Our results show that handling user interruptions effectively and efficiently during long-horizon agentic tasks remains challenging for powerful large-scale LLMs. Code and dataset are available at https://github.com/HenryPengZou/InterruptBench.

标签

LLM Agents Web Navigation Interruptions Benchmarking

arXiv 分类

cs.CL