Multimodal Learning 相关度: 9/10

ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding

Daichi Yashima, Shuhei Kurita, Yusuke Oda, Komei Sugiura
arXiv: 2602.16412v1 发布: 2026-02-18 更新: 2026-02-18

AI 摘要

ReMoRa通过精炼的运动表征,提升多模态大语言模型在长视频理解上的性能。

主要贡献

  • 提出ReMoRa,一种基于压缩表示的视频MLLM
  • 使用运动表征编码时间动态,减少计算冗余
  • 引入模块降噪并生成细粒度的运动表征

方法论

ReMoRa保留RGB关键帧提取外观信息,利用运动表征捕捉时间动态,并通过降噪模块生成精细的运动信息。

原文摘要

While multimodal large language models (MLLMs) have shown remarkable success across a wide range of tasks, long-form video understanding remains a significant challenge. In this study, we focus on video understanding by MLLMs. This task is challenging because processing a full stream of RGB frames is computationally intractable and highly redundant, as self-attention have quadratic complexity with sequence length. In this paper, we propose ReMoRa, a video MLLM that processes videos by operating directly on their compressed representations. A sparse set of RGB keyframes is retained for appearance, while temporal dynamics are encoded as a motion representation, removing the need for sequential RGB frames. These motion representations act as a compact proxy for optical flow, capturing temporal dynamics without full frame decoding. To refine the noise and low fidelity of block-based motions, we introduce a module to denoise and generate a fine-grained motion representation. Furthermore, our model compresses these features in a way that scales linearly with sequence length. We demonstrate the effectiveness of ReMoRa through extensive experiments across a comprehensive suite of long-video understanding benchmarks. ReMoRa outperformed baseline methods on multiple challenging benchmarks, including LongVideoBench, NExT-QA, and MLVU.

标签

MLLM 多模态 视频理解 运动表征 长视频

arXiv 分类

cs.CV