AI Agents 相关度: 9/10

Agentic Test-Time Scaling for WebAgents

Nicholas Lee, Lutfi Eren Erdogan, Chris Joseph John, Surya Krishnapillai, Michael W. Mahoney, Kurt Keutzer, Amir Gholami

arXiv: 2602.12276v1 发布: 2026-02-12 更新: 2026-02-12

下载 PDF arXiv 页面

AI 摘要

针对WebAgent，提出一种基于置信度的动态计算分配方法CATTS，提升效率和性能。

主要贡献

发现均匀增加计算量在长程任务中收益递减
提出基于Agent投票分布的不确定性统计指标
提出Confidence-Aware Test-Time Scaling (CATTS) 策略

方法论

通过对WebAgent的推理时缩放进行经验研究，利用Agent投票分布的不确定性指标，动态分配计算资源。

原文摘要

Test-time scaling has become a standard way to improve performance and boost reliability of neural network models. However, its behavior on agentic, multi-step tasks remains less well-understood: small per-step errors can compound over long horizons; and we find that naive policies that uniformly increase sampling show diminishing returns. In this work, we present CATTS, a simple technique for dynamically allocating compute for multi-step agents. We first conduct an empirical study of inference-time scaling for web agents. We find that uniformly increasing per-step compute quickly saturates in long-horizon environments. We then investigate stronger aggregation strategies, including an LLM-based Arbiter that can outperform naive voting, but that can overrule high-consensus decisions. We show that uncertainty statistics derived from the agent's own vote distribution (entropy and top-1/top-2 margin) correlate with downstream success and provide a practical signal for dynamic compute allocation. Based on these findings, we introduce Confidence-Aware Test-Time Scaling (CATTS), which uses vote-derived uncertainty to allocate compute only when decisions are genuinely contentious. CATTS improves performance on WebArena-Lite and GoBrowse by up to 9.1% over React while using up to 2.3x fewer tokens than uniform scaling, providing both efficiency gains and an interpretable decision rule.

arXiv 分类

cs.AI cs.CL

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类