WeaveTime: Stream from Earlier Frames into Emergent Memory in VideoLLMs
AI 摘要
WeaveTime解决了视频LLM在流式处理中时间感知不足的问题,提升了准确性和效率。
主要贡献
- 提出了时间感知问题Time-Agnosticism
- 设计了流式顺序感知增强Temporal Reconstruction
- 引入了过去-现在动态焦点缓存Past-Current Dynamic Focus Cache
方法论
通过时间重建目标,使模型学习顺序感知表示,并使用动态焦点缓存检索历史信息,提升流式视频理解能力。
原文摘要
Recent advances in Multimodal Large Language Models have greatly improved visual understanding and reasoning, yet their quadratic attention and offline training protocols make them ill-suited for streaming settings where frames arrive sequentially and future observations are inaccessible. We diagnose a core limitation of current Video-LLMs, namely Time-Agnosticism, in which videos are treated as an unordered bag of evidence rather than a causally ordered sequence, yielding two failures in streams: temporal order ambiguity, in which the model cannot follow or reason over the correct chronological order, and past-current focus blindness where it fails to distinguish present observations from accumulated history. We present WeaveTime, a simple, efficient, and model agnostic framework that first teaches order and then uses order. We introduce a lightweight Temporal Reconstruction objective-our Streaming Order Perception enhancement-that instills order aware representations with minimal finetuning and no specialized streaming data. At inference, a Past-Current Dynamic Focus Cache performs uncertainty triggered, coarse-to-fine retrieval, expanding history only when needed. Plugged into exsiting Video-LLM without architectural changes, WeaveTime delivers consistent gains on representative streaming benchmarks, improving accuracy while reducing latency. These results establish WeaveTime as a practical path toward time aware stream Video-LLMs under strict online, time causal constraints. Code and weights will be made publicly available. Project Page: https://zhangyl4.github.io/publications/weavetime/