Multimodal Learning 相关度: 9/10

Beyond Sequential Distance: Inter-Modal Distance Invariant Position Encoding

Lin Chen, Bolin Ni, Qi Yang, Zili Wang, Kun Ding, Ying Wang, Houwen Peng, Shiming Xiang
arXiv: 2603.10863v1 发布: 2026-03-11 更新: 2026-03-11

AI 摘要

提出了跨模态距离不变位置编码DIPE,缓解了MLLM长文本中视觉信息衰减问题。

主要贡献

  • 提出跨模态距离不变位置编码DIPE
  • 缓解了长文本场景下的视觉信息衰减问题
  • 在长短文本场景下均验证了有效性

方法论

DIPE保留模内相对位置编码,同时为模间交互引入锚定感知邻近性,从而解耦模态交互的位置编码。

原文摘要

Despite the remarkable capabilities of Multimodal Large Language Models (MLLMs), they still suffer from visual fading in long-context scenarios. Specifically, the attention to visual tokens diminishes as the text sequence lengthens, leading to text generation detached from visual constraints. We attribute this degradation to the inherent inductive bias of Multimodal RoPE, which penalizes inter-modal attention as the distance between visual and text tokens increases. To address this, we propose inter-modal Distance Invariant Position Encoding (DIPE), a simple but effective mechanism that disentangles position encoding based on modality interactions. DIPE retains the natural relative positioning for intra-modal interactions to preserve local structure, while enforcing an anchored perceptual proximity for inter-modal interactions. This strategy effectively mitigates the inter-modal distance-based penalty, ensuring that visual signals remain perceptually consistent regardless of the context length. Experimental results demonstrate that by integrating DIPE with Multimodal RoPE, the model maintains stable visual grounding in long-context scenarios, significantly alleviating visual fading while preserving performance on standard short-context benchmarks. Code is available at https://github.com/lchen1019/DIPE.

标签

多模态学习 位置编码 长文本 视觉信息衰减

arXiv 分类

cs.CV