VisBrowse-Bench: Benchmarking Visual-Native Search for Multimodal Browsing Agents
AI 摘要
提出了一个名为VisBrowse-Bench的新型视觉原生搜索基准,用于评估多模态浏览代理的视觉推理能力。
主要贡献
- 提出了VisBrowse-Bench基准数据集,包含169个VQA实例
- 提出了一个代理工作流,用于驱动浏览代理主动收集和推理视觉信息
- 对开源和闭源模型进行了全面的评估,发现现有模型性能有待提高
方法论
通过多阶段pipeline构建数据集,包含多模态证据交叉验证的文本-图像检索和联合推理。设计代理工作流,驱动Agent进行视觉信息收集和推理。
原文摘要
The rapid advancement of Multimodal Large Language Models (MLLMs) has enabled browsing agents to acquire and reason over multimodal information in the real world. But existing benchmarks suffer from two limitations: insufficient evaluation of visual reasoning ability and the neglect of native visual information of web pages in the reasoning chains. To address these challenges, we introduce a new benchmark for visual-native search, VisBrowse-Bench. It contains 169 VQA instances covering multiple domains and evaluates the models' visual reasoning capabilities during the search process through multimodal evidence cross-validation via text-image retrieval and joint reasoning. These data were constructed by human experts using a multi-stage pipeline and underwent rigorous manual verification. We additionally propose an agent workflow that can effectively drive the browsing agent to actively collect and reason over visual information during the search process. We comprehensively evaluated both open-source and closed-source models in this workflow. Experimental results show that even the best-performing model, Claude-4.6-Opus only achieves an accuracy of 47.6%, while the proprietary Deep Research model, o3-deep-research only achieves an accuracy of 41.1%. The code and data can be accessed at: https://github.com/ZhengboZhang/VisBrowse-Bench