Query-Conditioned Evidential Keyframe Sampling for MLLM-Based Long-Form Video Understanding
AI 摘要
提出了一种基于信息瓶颈理论的查询条件式证据关键帧采样方法,提升MLLM在长视频理解任务上的性能。
主要贡献
- 提出了基于信息瓶颈理论的关键帧采样框架
- 设计了查询条件式证据评分网络
- 在长视频理解基准测试中优于现有方法
方法论
将关键帧选择建模为最大化查询与帧之间的条件互信息,并训练查询条件式证据评分网络。
原文摘要
Multimodal Large Language Models (MLLMs) have shown strong performance on video question answering, but their application to long-form videos is constrained by limited context length and computational cost, making keyframe sampling essential. Existing approaches typically rely on semantic relevance or reinforcement learning, which either fail to capture evidential clues or suffer from inefficient combinatorial optimization. In this work, we propose an evidence-driven keyframe sampling framework grounded in information bottleneck theory. We formulate keyframe selection as maximizing the conditional mutual information between selected frames and the query, providing a principled objective that reflects each frame's contribution to answering the question. To make this objective tractable, we exploit its structure to derive a decomposed optimization that reduces subset selection to independent frame-level scoring. We further introduce a query-conditioned evidence scoring network trained with a contrastive objective to estimate evidential importance efficiently. Experiments on long-form video understanding benchmarks show that our method consistently outperforms prior sampling strategies under strict token budgets, while significantly improving training efficiency.