AI Agents 相关度: 9/10

AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios

Zhaochen Su, Jincheng Gao, Hangyu Guo, Zhenhua Liu, Lueyang Zhang, Xinyu Geng, Shijue Huang, Peng Xia, Guanyu Jiang, Cheng Wang, Yue Zhang, Yi R. Fung, Junxian He
arXiv: 2602.23166v1 发布: 2026-02-26 更新: 2026-02-26

AI 摘要

AgentVista基准测试通过复杂视觉场景评估多模态Agent的工具使用能力。

主要贡献

  • 提出了AgentVista基准测试,包含25个子领域。
  • 结合现实场景和自然混合工具使用。
  • 评估了现有模型的长时程多模态工具使用能力,并发现了差距。

方法论

构建包含丰富视觉细节的真实场景,任务需要跨模态和长时程的工具交互,如网页搜索、图像搜索等。

原文摘要

Real-world multimodal agents solve multi-step workflows grounded in visual evidence. For example, an agent can troubleshoot a device by linking a wiring photo to a schematic and validating the fix with online documentation, or plan a trip by interpreting a transit map and checking schedules under routing constraints. However, existing multimodal benchmarks mainly evaluate single-turn visual reasoning or specific tool skills, and they do not fully capture the realism, visual subtlety, and long-horizon tool use that practical agents require. We introduce AgentVista, a benchmark for generalist multimodal agents that spans 25 sub-domains across 7 categories, pairing realistic and detail-rich visual scenarios with natural hybrid tool use. Tasks require long-horizon tool interactions across modalities, including web search, image search, page navigation, and code-based operations for both image processing and general programming. Comprehensive evaluation of state-of-the-art models exposes significant gaps in their ability to carry out long-horizon multimodal tool use. Even the best model in our evaluation, Gemini-3-Pro with tools, achieves only 27.3% overall accuracy, and hard instances can require more than 25 tool-calling turns. We expect AgentVista to accelerate the development of more capable and reliable multimodal agents for realistic and ultra-challenging problem solving.

标签

多模态 Agent 基准测试 工具使用 长时程

arXiv 分类

cs.CV