Multimodal Learning 相关度: 9/10

Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos

Songtao Jiang, Sibo Song, Chenyi Zhou, Yuan Wang, Ruizhe Chen, Tongkun Guan, Ruilin Luo, Yan Zhang, Zhihang Tang, Yuchong Sun, Hang Zhang, Zhibo Yang, Shuai Bai, Junyang Lin, Zuozhu Liu

arXiv: 2603.17693v1 发布: 2026-03-18 更新: 2026-03-18

下载 PDF arXiv 页面

AI 摘要

提出SynRL框架，利用合成视频学习可迁移的时间基元，提升视频推理能力。

主要贡献

提出了SynRL框架，用于学习时间基元
使用程序化生成的合成视频进行训练，成本效益高
在多个视频理解基准测试中取得了显著改进

方法论

通过程序化生成合成视频，构建短时和长时时间基元数据集，并采用CoT和RL样本进行训练。

原文摘要

The transition from image to video understanding requires vision-language models (VLMs) to shift from recognizing static patterns to reasoning over temporal dynamics such as motion trajectories, speed changes, and state transitions. Yet current post-training methods fall short due to two critical limitations: (1) existing datasets often lack temporal-centricity, where answers can be inferred from isolated keyframes rather than requiring holistic temporal integration; and (2) training data generated by proprietary models contains systematic errors in fundamental temporal perception, such as confusing motion directions or misjudging speeds. We introduce SynRL, a post-training framework that teaches models temporal primitives, the fundamental building blocks of temporal understanding including direction, speed, and state tracking. Our key insight is that these abstract primitives, learned from programmatically generated synthetic videos, transfer effectively to real-world scenarios. We decompose temporal understanding into short-term perceptual primitives (speed, direction) and long-term cognitive primitives, constructing 7.7K CoT and 7K RL samples with ground-truth frame-level annotations through code-based video generation. Despite training on simple geometric shapes, SynRL achieves substantial improvements across 15 benchmarks spanning temporal grounding, complex reasoning, and general video understanding. Remarkably, our 7.7K synthetic CoT samples outperform Video-R1 with 165K real-world samples. We attribute this to fundamental temporal skills, such as tracking frame by frame changes and comparing velocity, that transfer effectively from abstract synthetic patterns to complex real-world scenarios. This establishes a new paradigm for video post-training: video temporal learning through carefully designed synthetic data provides a more cost efficient scaling path.

arXiv 分类

cs.CV

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类