Multimodal Learning 相关度: 9/10

ARK: A Dual-Axis Multimodal Retrieval Benchmark along Reasoning and Knowledge

Yijie Lin, Guofeng Ding, Haochen Zhou, Haobin Li, Mouxing Yang, Xi Peng
arXiv: 2602.09839v1 发布: 2026-02-10 更新: 2026-02-10

AI 摘要

提出了ARK基准,用于评估多模态检索在知识和推理方面的能力,并分析了现有模型的不足。

主要贡献

  • 提出了ARK基准数据集,包含知识领域和推理技能两个维度
  • 分析了现有模型在知识密集型和推理密集型检索中的差距
  • 评估了多种检索模型,并提出了改进方法

方法论

构建包含多模态查询和候选的检索数据集,并设计针对性的难例,用于评估模型的知识和推理能力。

原文摘要

Existing multimodal retrieval benchmarks largely emphasize semantic matching on daily-life images and offer limited diagnostics of professional knowledge and complex reasoning. To address this gap, we introduce ARK, a benchmark designed to analyze multimodal retrieval from two complementary perspectives: (i) knowledge domains (five domains with 17 subtypes), which characterize the content and expertise retrieval relies on, and (ii) reasoning skills (six categories), which characterize the type of inference over multimodal evidence required to identify the correct candidate. Specifically, ARK evaluates retrieval with both unimodal and multimodal queries and candidates, covering 16 heterogeneous visual data types. To avoid shortcut matching during evaluation, most queries are paired with targeted hard negatives that require multi-step reasoning. We evaluate 23 representative text-based and multimodal retrievers on ARK and observe a pronounced gap between knowledge-intensive and reasoning-intensive retrieval, with fine-grained visual and spatial reasoning emerging as persistent bottlenecks. We further show that simple enhancements such as re-ranking and rewriting yield consistent improvements, but substantial headroom remains.

标签

多模态检索 基准数据集 知识推理 视觉语言

arXiv 分类

cs.CV