AI Agents 相关度: 9/10

CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents

Marta Sumyk, Oleksandr Kosovan
arXiv: 2603.10577v1 发布: 2026-03-11 更新: 2026-03-11

AI 摘要

该论文评估了视觉-语言模型作为自主计算机使用代理审计器的能力,揭示了其局限性。

主要贡献

  • 评估了 VLMs 作为 CUA 审计器的能力
  • 分析了 VLM 审计器在不同环境下的表现
  • 指出了现有基于模型审计方法的局限性

方法论

对五个 VLM 在三个 CUA 基准测试上进行大规模评估,分析其准确性、校准度和模型间一致性。

原文摘要

Computer-Use Agents (CUAs) are emerging as a new paradigm in human-computer interaction, enabling autonomous execution of tasks in desktop environment by perceiving high-level natural-language instructions. As such agents become increasingly capable and are deployed across diverse desktop environments, evaluating their behavior in a scalable and reliable manner becomes a critical challenge. Existing evaluation pipelines rely on static benchmarks, rule-based success checks, or manual inspection, which are brittle, costly, and poorly aligned with real-world usage. In this work, we study Vision-Language Models (VLMs) as autonomous auditors for assessing CUA task completion directly from observable interactions and conduct a large-scale meta-evaluation of five VLMs that judge task success given a natural-language instruction and the final environment state. Our evaluation spans three widely used CUA benchmarks across macOS, Windows, and Linux environments and analyzes auditor behavior along three complementary dimensions: accuracy, calibration of confidence estimates, and inter-model agreement. We find that while state-of-the-art VLMs achieve strong accuracy and calibration, all auditors exhibit notable performance degradation in more complex or heterogeneous environments, and even high-performing models show significant disagreement in their judgments. These results expose fundamental limitations of current model-based auditing approaches and highlight the need to explicitly account for evaluator reliability, uncertainty, and variance when deploying autonomous CUAs in real-world settings.

标签

Vision-Language Models Computer-Use Agents Auditing Meta-Evaluation

arXiv 分类

cs.AI cs.HC