AI Agents 相关度: 8/10

Chain of World: World Model Thinking in Latent Motion

Fuxiang Yang, Donglin Di, Lulu Tang, Xuancheng Zhang, Lei Fan, Hao Li, Chen Wei, Tonghua Su, Baorui Ma
arXiv: 2603.03195v1 发布: 2026-03-03 更新: 2026-03-03

AI 摘要

CoWVLA通过解耦潜在运动表示,统一了世界模型的时序推理和潜在动作的紧凑性,提升了视觉运动学习效果。

主要贡献

  • 提出了CoWVLA框架,结合世界模型和潜在动作的优势
  • 使用预训练视频VAE提取结构和运动潜在表示
  • 通过联合建模稀疏关键帧和动作序列实现动作预测

方法论

利用VAE解耦视频为结构和运动潜在表示,VLA学习潜在运动链,结合关键帧和动作序列进行自回归解码。

原文摘要

Vision-Language-Action (VLA) models are a promising path toward embodied intelligence, yet they often overlook the predictive and temporal-causal structure underlying visual dynamics. World-model VLAs address this by predicting future frames, but waste capacity reconstructing redundant backgrounds. Latent-action VLAs encode frame-to-frame transitions compactly, but lack temporally continuous dynamic modeling and world knowledge. To overcome these limitations, we introduce CoWVLA (Chain-of-World VLA), a new "Chain of World" paradigm that unifies world-model temporal reasoning with a disentangled latent motion representation. First, a pretrained video VAE serves as a latent motion extractor, explicitly factorizing video segments into structure and motion latents. Then, during pre-training, the VLA learns from an instruction and an initial frame to infer a continuous latent motion chain and predict the segment's terminal frame. Finally, during co-fine-tuning, this latent dynamic is aligned with discrete action prediction by jointly modeling sparse keyframes and action sequences in a unified autoregressive decoder. This design preserves the world-model benefits of temporal reasoning and world knowledge while retaining the compactness and interpretability of latent actions, enabling efficient visuomotor learning. Extensive experiments on robotic simulation benchmarks show that CoWVLA outperforms existing world-model and latent-action approaches and achieves moderate computational efficiency, highlighting its potential as a more effective VLA pretraining paradigm. The project website can be found at https://fx-hit.github.io/cowvla-io.

标签

VLA World Model Latent Action Embodied Intelligence

arXiv 分类

cs.CV cs.AI cs.RO