Multimodal Learning 相关度: 9/10

MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding

Wenhui Tan, Xiaoyi Yu, Jiaze Li, Yijing Chen, Jianzhong Ju, Zhenbo Luo, Ruihua Song, Jian Luan

arXiv: 2602.22932v1 发布: 2026-02-26 更新: 2026-02-26

下载 PDF arXiv 页面

AI 摘要

MSJoE通过联合优化MLLM和采样器，提升长视频理解效率和准确率。

主要贡献

提出MSJoE框架，联合演化MLLM和轻量级关键帧采样器
引入查询推理，提升关键帧选择的准确性
构建长视频QA数据集，支持模型训练

方法论

利用CLIP模型提取查询-帧相似度矩阵，轻量级采样器预测权重，通过强化学习联合优化MLLM和采样器。

原文摘要

Efficiently understanding long-form videos remains a fundamental challenge for multimodal large language models (MLLMs). In this paper, we present MLLM-Sampler Joint Evolution (MSJoE), a novel framework that jointly evolves the MLLM and a lightweight key-frame sampler for efficient long-form video understanding. MSJoE builds upon a key assumption that only a small subset of key-frames is truly informative for answering each question to a video. Specifically, MSJoE first reasons out several queries, which describe diverse visual perspectives relevant to the question. Then, these queries interact with a frozen CLIP model to produce a query-frame similarity matrix. Finally, a lightweight sampler predicts key-frame sampling weights from this matrix, selecting a compact set of informative frames, which are then fed into the MLLM for answer generation. Both the MLLM and sampler are jointly optimized through reinforcement learning, enabling co-adaptation of query-reasoning, frame-sampling, and key-frame understanding. A new long-video QA dataset containing 2.8K videos with 7K question-answer pairs is collected to support the training process. Extensive experiments on VideoMME, LongVideoBench, LVBench, and MLVU show that MSJoE achieves 8.0\% accuracy gain upon the base MLLM, and 1.1\% higher accuracy than strongest baseline method.

arXiv 分类

cs.CV

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类