Multimodal Learning 相关度: 9/10

A Simple Baseline for Streaming Video Understanding

Yujiao Shen, Shulin Tian, Jingkang Yang, Ziwei Liu
arXiv: 2604.02317v1 发布: 2026-04-02 更新: 2026-04-02

AI 摘要

提出SimpleStream基线,仅用滑动窗口即可媲美复杂流视频理解模型,揭示感知-记忆权衡。

主要贡献

  • 提出了一个简单的滑动窗口基线SimpleStream
  • 验证了SimpleStream在流视频理解任务上的有效性
  • 揭示了感知和记忆之间的权衡
  • 建议未来基准测试应区分近期场景感知和长程记忆

方法论

使用滑动窗口,将最近N帧输入预训练的VLM,进行流视频理解任务,并与其他模型进行对比。

原文摘要

Recent streaming video understanding methods increasingly rely on complex memory mechanisms to handle long video streams. We challenge this trend with a simple finding: a sliding-window baseline that feeds only the most recent N frames to an off-the-shelf VLM already matches or surpasses published streaming models. We formalize this baseline as SimpleStream and evaluate it against 13 major offline and online video LLM baselines on OVO-Bench and StreamingBench. Despite its simplicity, SimpleStream delivers consistently strong performance. With only 4 recent frames, it reaches 67.7% average accuracy on OVO-Bench and 80.59% on StreamingBench. Controlled ablations further show that the value of longer context is backbone-dependent rather than uniformly increasing with model scale, and reveal a consistent perception-memory trade-off: adding more historical context can improve recall, but often weakens real-time perception. This suggests that stronger memory, retrieval, or compression modules should not be taken as evidence of progress unless they clearly outperform SimpleStream under the same protocol. We therefore argue that future streaming benchmarks should separate recent-scene perception from long-range memory, so that performance improvements from added complexity can be evaluated more clearly.

标签

streaming video understanding VLM sliding window baseline perception-memory trade-off

arXiv 分类

cs.CV