Multimodal Learning 相关度: 10/10

HORNet: Task-Guided Frame Selection for Video Question Answering with Vision-Language Models

Xiangyu Bai, Bishoy Galoaa, Sarah Ostadabbas
arXiv: 2603.18850v1 发布: 2026-03-19 更新: 2026-03-19

AI 摘要

HORNet通过学习帧选择策略提升VLM在视频问答任务上的效率和性能。

主要贡献

  • 提出HORNet,一种轻量级的帧选择策略
  • 使用Group Relative Policy Optimization (GRPO)训练帧选择策略
  • 在多个视频问答基准测试上取得了显著的性能提升

方法论

使用GRPO训练轻量级帧选择策略,指导VLM选择关键帧进行视频问答。

原文摘要

Video question answering (VQA) with vision-language models (VLMs) depends critically on which frames are selected from the input video, yet most systems rely on uniform or heuristic sampling that cannot be optimized for downstream answering quality. We introduce \textbf{HORNet}, a lightweight frame selection policy trained with Group Relative Policy Optimization (GRPO) to learn which frames a frozen VLM needs to answer questions correctly. With fewer than 1M trainable parameters, HORNet reduces input frames by up to 99\% and VLM processing time by up to 93\%, while improving answer quality on short-form benchmarks (+1.7\% F1 on MSVD-QA) and achieving strong performance on temporal reasoning tasks (+7.3 points over uniform sampling on NExT-QA). We formalize this as Select Any Frames (SAF), a task that decouples visual input curation from VLM reasoning, and show that GRPO-trained selection generalizes better out-of-distribution than supervised and PPO alternatives. HORNet's policy further transfers across VLM answerers without retraining, yielding an additional 8.5\% relative gain when paired with a stronger model. Evaluated across six benchmarks spanning 341,877 QA pairs and 114.2 hours of video, our results demonstrate that optimizing \emph{what} a VLM sees is a practical and complementary alternative to optimizing what it generates while improving efficiency. Code is available at https://github.com/ostadabbas/HORNet.

标签

视频问答 视觉语言模型 帧选择 强化学习

arXiv 分类

cs.CV