Multimodal Learning 相关度: 9/10

Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models

Jialiang Zhang, Junlong Tong, Junyan Lin, Hao Wu, Yirong Sun, Yunpu Ma, Xiaoyu Shen
arXiv: 2603.02872v1 发布: 2026-03-03 更新: 2026-03-03

AI 摘要

提出Think-as-You-See (TaYS),一个针对视频流的并行化CoT推理框架,提升LVLM在视频理解任务中的效率和响应速度。

主要贡献

  • 提出TaYS框架,实现视频流的并行化CoT推理
  • 引入时间对齐的推理单元、流式注意力机制和双KV缓存
  • 在视频CoT任务上验证了TaYS的有效性,显著降低了延迟

方法论

TaYS框架通过并行生成CoT、流约束训练和流并行推理,结合时间对齐的推理单元和双KV缓存,实现高效的视频流推理。

原文摘要

Large Vision Language Models (LVLMs) exhibit strong Chain-of-Thought (CoT) capabilities, yet most existing paradigms assume full-video availability before inference, a batch-style process misaligned with real-world video streams where information arrives sequentially. Motivated by the streaming nature of video data, we investigate two streaming reasoning paradigms for LVLMs. The first, an interleaved paradigm, alternates between receiving frames and producing partial reasoning but remains constrained by strictly ordered cache updates. To better match streaming inputs, we propose \textbf{Think-as-You-See (TaYS)}, a unified framework enabling true concurrent reasoning. TaYS integrates parallelized CoT generation, stream-constrained training, and stream-parallel inference. It further employs temporally aligned reasoning units, streaming attention masks and positional encodings, and a dual KV-cache that decouples visual encoding from textual reasoning. We evaluate all paradigms on the Qwen2.5-VL family across representative video CoT tasks, including event dynamics analysis, causal reasoning, and thematic understanding. Experiments show that TaYS consistently outperforms both batch and interleaved baselines, improving reasoning performance while substantially reducing time-to-first-token (TTFT) and overall reasoning delay. These results demonstrate the effectiveness of data-aligned streaming reasoning in enabling efficient and responsive video understanding for LVLMs. We release our code at \href{https://github.com/EIT-NLP/StreamingLLM/tree/main/TaYS}{this repository.}

标签

LVLM Chain-of-Thought 视频流 并行推理

arXiv 分类

cs.CV