Multimodal Learning 相关度: 9/10

V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval

Dongyang Chen, Chaoyang Wang, Dezhao SU, Xi Xiao, Zeyu Zhang, Jing Xiong, Qing Li, Yuzhang Shang, Shichao Ka
arXiv: 2602.06034v1 发布: 2026-02-05 更新: 2026-02-05

AI 摘要

V-Retrver通过视觉证据驱动的Agent推理,提升通用多模态检索的准确性和可靠性。

主要贡献

  • 提出V-Retrver框架,利用Agent进行视觉证据驱动的推理
  • 引入课程学习策略,训练证据收集检索Agent
  • 实验证明V-Retrver在多模态检索任务上的性能提升

方法论

使用MLLM作为Agent,通过视觉工具选择性获取视觉证据,进行多模态交错推理,结合课程学习进行Agent训练。

原文摘要

Multimodal Large Language Models (MLLMs) have recently been applied to universal multimodal retrieval, where Chain-of-Thought (CoT) reasoning improves candidate reranking. However, existing approaches remain largely language-driven, relying on static visual encodings and lacking the ability to actively verify fine-grained visual evidence, which often leads to speculative reasoning in visually ambiguous cases. We propose V-Retrver, an evidence-driven retrieval framework that reformulates multimodal retrieval as an agentic reasoning process grounded in visual inspection. V-Retrver enables an MLLM to selectively acquire visual evidence during reasoning via external visual tools, performing a multimodal interleaved reasoning process that alternates between hypothesis generation and targeted visual verification.To train such an evidence-gathering retrieval agent, we adopt a curriculum-based learning strategy combining supervised reasoning activation, rejection-based refinement, and reinforcement learning with an evidence-aligned objective. Experiments across multiple multimodal retrieval benchmarks demonstrate consistent improvements in retrieval accuracy (with 23.0% improvements on average), perception-driven reasoning reliability, and generalization.

标签

多模态检索 视觉推理 Agent MLLM 证据驱动

arXiv 分类

cs.CV