Multimodal Learning 相关度: 9/10

TRACE: Task-Adaptive Reasoning and Representation Learning for Universal Multimodal Retrieval

Xiangzhao Hao, Shijie Wang, Tianyu Yang, Tianyue Wang, Haiyun Guo, JinQiao Wang
arXiv: 2603.02929v1 发布: 2026-03-03 更新: 2026-03-03

AI 摘要

TRACE通过生成式推理和判别式表示学习,提升通用多模态检索性能,实现任务自适应。

主要贡献

  • 提出TRACE框架,融合生成式推理和判别式表示学习
  • 构建M-BEIR-CoT数据集,用于训练推理模型
  • 实现复杂查询的自主推理,简单查询的快速检索

方法论

TRACE首先生成CoT进行推理,然后将推理过程压缩成紧凑的嵌入向量,用于判别式检索。

原文摘要

Universal Multimodal Retrieval requires unified embedding models capable of interpreting diverse user intents, ranging from simple keywords to complex compositional instructions. While Multimodal Large Language Models (MLLMs) possess strong reasoning capabilities, prevailing adaptations confine them to static encoders, underutilizing their generative potential. This encoder-only paradigm struggles with complex intents that demand logical deduction rather than superficial pattern matching. To address this, we introduce TRACE (Task-adaptive Reasoning And Compressing Embeddings). TRACE unifies generative reasoning with discriminative representation learning. It first generates a structured Chain-of-Thought (CoT) to explicitly reason about the query, and subsequently compresses this reasoning trace into a compact embedding via a dedicated token. To train this framework, we construct M-BEIR-CoT, a large-scale dataset featuring a difficulty-aware routing strategy. Experiments on the M-BEIR benchmark establish TRACE as the new state-of-the-art. Crucially, TRACE demonstrates a learned implicit routing behavior. It autonomously activates reasoning for complex queries while bypassing it for simpler ones, achieving an optimal balance between retrieval accuracy and inference throughput. Furthermore, by internalizing the deductive process, TRACE exhibits remarkable zero-shot transferability to unseen domains and novel constraints.

标签

Multimodal Retrieval Chain-of-Thought Representation Learning Task-Adaptive

arXiv 分类

cs.CV