Multimodal Learning 相关度: 9/10

Evo-Retriever: LLM-Guided Curriculum Evolution with Viewpoint-Pathway Collaboration for Multimodal Document Retrieval

Weiqing Li, Jinyue Guo, Yaqi Wang, Haiyang Xiao, Yuewei Zhang, Guohua Liu, Hao Henry Wang
arXiv: 2603.16455v1 发布: 2026-03-17 更新: 2026-03-17

AI 摘要

Evo-Retriever利用LLM指导的课程演化,通过多视角-路径协同提升多模态文档检索性能。

主要贡献

  • 提出了基于LLM指导的课程演化检索框架Evo-Retriever
  • 设计了多视角图像对齐方法增强细粒度匹配
  • 引入双向对比学习策略生成难查询并建立互补学习路径

方法论

采用多视角图像对齐、双向对比学习和LLM元控制器自适应调整训练课程的方法。

原文摘要

Visual-language models (VLMs) excel at data mappings, but real-world document heterogeneity and unstructuredness disrupt the consistency of cross-modal embeddings. Recent late-interaction methods enhance image-text alignment through multi-vector representations, yet traditional training with limited samples and static strategies cannot adapt to the model's dynamic evolution, causing cross-modal retrieval confusion. To overcome this, we introduce Evo-Retriever, a retrieval framework featuring an LLM-guided curriculum evolution built upon a novel Viewpoint-Pathway collaboration. First, we employ multi-view image alignment to enhance fine-grained matching via multi-scale and multi-directional perspectives. Then, a bidirectional contrastive learning strategy generates "hard queries" and establishes complementary learning paths for visual and textual disambiguation to rebalance supervision. Finally, the model-state summary from the above collaboration is fed into an LLM meta-controller, which adaptively adjusts the training curriculum using expert knowledge to promote the model's evolution. On ViDoRe V2 and MMEB (VisDoc), Evo-Retriever achieves state-of-the-art performance, with nDCG@5 scores of 65.2% and 77.1%.

标签

多模态学习 文档检索 视觉语言模型 课程学习

arXiv 分类

cs.CV