LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding
AI 摘要
LongVideo-R1提出了一种高效的、基于推理的多模态Agent,用于低成本的长视频理解。
主要贡献
- 提出了LongVideo-R1 Agent,用于高效长视频理解。
- 引入推理模块,利用视觉线索导航视频上下文。
- 采用两阶段微调(SFT+RL)优化Agent的导航能力。
方法论
利用分层视频字幕训练GPT-5生成CoT轨迹,然后通过SFT和RL微调Qwen-3-8B模型,设计奖励函数优化导航效率。
原文摘要
This paper addresses the critical and underexplored challenge of long video understanding with low computational budgets. We propose LongVideo-R1, an active, reasoning-equipped multimodal large language model (MLLM) agent designed for efficient video context navigation, avoiding the redundancy of exhaustive search. At the core of LongVideo-R1 lies a reasoning module that leverages high-level visual cues to infer the most informative video clip for subsequent processing. During inference, the agent initiates traversal from top-level visual summaries and iteratively refines its focus, immediately halting the exploration process upon acquiring sufficient knowledge to answer the query. To facilitate training, we first extract hierarchical video captions from CGBench, a video corpus with grounding annotations, and guide GPT-5 to generate 33K high-quality chain-of-thought-with-tool trajectories. The LongVideo-R1 agent is fine-tuned upon the Qwen-3-8B model through a two-stage paradigm: supervised fine-tuning (SFT) followed by reinforcement learning (RL), where RL employs a specifically designed reward function to maximize selective and efficient clip navigation. Experiments on multiple long video benchmarks validate the effectiveness of name, which enjoys superior tradeoff between QA accuracy and efficiency. All curated data and source code are provided in the supplementary material and will be made publicly available. Code and data are available at: https://github.com/qiujihao19/LongVideo-R1