Multimodal Learning 相关度: 9/10

PLUME: Latent Reasoning Based Universal Multimodal Embedding

Chenwei He, Xiangzhao Hao, Tianyu Yang, Yuxiang Ma, Yuheng Jia, Lingxiang Wu, Chaoyang Zhao, Haiyun Guo, Jinqiao Wang
arXiv: 2604.02073v1 发布: 2026-04-02 更新: 2026-04-02

AI 摘要

PLUME提出一种基于隐变量推理的通用多模态嵌入框架,提升推理效率。

主要贡献

  • 提出PLUME框架,用隐变量推理替代显式CoT。
  • 引入语义锚点引导的过渡适配器,实现多样化的推理轨迹。
  • 采用渐进式显式到隐式课程学习,稳定训练过程。

方法论

用连续隐变量的自回归展开取代CoT,通过语义锚点引导推理方向,使用课程学习进行训练。

原文摘要

Universal multimodal embedding (UME) maps heterogeneous inputs into a shared retrieval space with a single model. Recent approaches improve UME by generating explicit chain-of-thought (CoT) rationales before extracting embeddings, enabling multimodal large language models to better infer complex query intent. However, explicit CoT incurs substantial inference overhead and can compress rich multimodal evidence into a narrow textual bottleneck. We propose PLUME, a latent reasoning framework that advances UME by replacing verbalized CoT with a short autoregressive rollout of continuous latent states. To support diverse multimodal queries, PLUME further introduces a semantic-anchor-guided transition adapter that steers latent rollout along different reasoning trajectories under the same fixed computation budget. To stabilize training, PLUME adopts a progressive explicit-to-latent curriculum that uses verbalized reasoning only as a temporary training scaffold and gradually transfers this behavior into hidden-state computation, eliminating explicit CoT at inference. On the 78-task MMEB-v2 benchmark, PLUME outperforms strong explicit-CoT UME baselines while reducing reasoning from hundreds of generated tokens to fewer than 10 latent steps, delivering over 30x faster inference. PLUME is especially well suited to retrieval settings where relevant evidence is dense, structurally complex, and difficult to organize through verbalized intermediate rationales, such as video and visual document retrieval. These results show that structured latent computation can preserve the benefits of intermediate reasoning without the overhead of explicit rationale generation, providing a stronger and more efficient paradigm for practical retrieval systems.

标签

多模态学习 通用多模态嵌入 隐变量推理 检索

arXiv 分类

cs.CV