AI Agents 相关度: 9/10

EVA: Efficient Reinforcement Learning for End-to-End Video Agent

Yaolun Zhang, Ruohui Wang, Jiahao Wang, Yepeng Tang, Xuanyu Zheng, Haonan Duan, Hao Lu, Hanming Deng, Lewei Lu
arXiv: 2603.22918v1 发布: 2026-03-24 更新: 2026-03-24

AI 摘要

EVA提出了一种高效的强化学习框架,用于端到端视频智能体,实现高效的视频理解。

主要贡献

  • 提出了EVA框架,实现高效的视频理解
  • 设计了三阶段学习流水线:SFT, KTO, GRPO
  • 构建了高质量的数据集用于模型训练

方法论

利用强化学习,通过迭代summary-plan-action-reflection推理,使智能体自主决定观看内容、时间和方式,实现query驱动的视频理解。

原文摘要

Video understanding with multimodal large language models (MLLMs) remains challenging due to the long token sequences of videos, which contain extensive temporal dependencies and redundant frames. Existing approaches typically treat MLLMs as passive recognizers, processing entire videos or uniformly sampled frames without adaptive reasoning. Recent agent-based methods introduce external tools, yet still depend on manually designed workflows and perception-first strategies, resulting in inefficiency on long videos. We present EVA, an Efficient Reinforcement Learning framework for End-to-End Video Agent, which enables planning-before-perception through iterative summary-plan-action-reflection reasoning. EVA autonomously decides what to watch, when to watch, and how to watch, achieving query-driven and efficient video understanding. To train such agents, we design a simple yet effective three-stage learning pipeline - comprising supervised fine-tuning (SFT), Kahneman-Tversky Optimization (KTO), and Generalized Reward Policy Optimization (GRPO) - that bridges supervised imitation and reinforcement learning. We further construct high-quality datasets for each stage, supporting stable and reproducible training. We evaluate EVA on six video understanding benchmarks, demonstrating its comprehensive capabilities. Compared with existing baselines, EVA achieves a substantial improvement of 6-12% over general MLLM baselines and a further 1-3% gain over prior adaptive agent methods. Our code and model are available at https://github.com/wangruohui/EfficientVideoAgent.

标签

视频理解 强化学习 多模态 Agent

arXiv 分类

cs.CV cs.AI cs.CL