Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing
AI 摘要
AutoGaze通过自回归方式选择关键视频帧,减少计算量,提升多模态大模型处理长视频的能力。
主要贡献
- 提出AutoGaze模块,显著减少视频处理中的冗余计算。
- 在多个视频基准测试上取得了优异的结果。
- 构建了高分辨率长视频问答数据集HLVid。
方法论
使用自回归方法,结合下一帧预测和强化学习,选择一组多尺度patches来重建视频,从而减少视觉tokens。
原文摘要
Multi-modal large language models (MLLMs) have advanced general-purpose video understanding but struggle with long, high-resolution videos -- they process every pixel equally in their vision transformers (ViTs) or LLMs despite significant spatiotemporal redundancy. We introduce AutoGaze, a lightweight module that removes redundant patches before processed by a ViT or an MLLM. Trained with next-token prediction and reinforcement learning, AutoGaze autoregressively selects a minimal set of multi-scale patches that can reconstruct the video within a user-specified error threshold, eliminating redundancy while preserving information. Empirically, AutoGaze reduces visual tokens by 4x-100x and accelerates ViTs and MLLMs by up to 19x, enabling scaling MLLMs to 1K-frame 4K-resolution videos and achieving superior results on video benchmarks (e.g., 67.0% on VideoMME). Furthermore, we introduce HLVid: the first high-resolution, long-form video QA benchmark with 5-minute 4K-resolution videos, where an MLLM scaled with AutoGaze improves over the baseline by 10.1% and outperforms the previous best MLLM by 4.5%. Project page: https://autogaze.github.io/.