AI Agents 相关度: 8/10

Fast-WAM: Do World Action Models Need Test-time Future Imagination?

Tianyuan Yuan, Zibin Dong, Yicheng Liu, Hang Zhao
arXiv: 2603.16666v1 发布: 2026-03-17 更新: 2026-03-17

AI 摘要

Fast-WAM通过去除测试时未来想象,显著提升速度,验证了训练时视频建模的重要性。

主要贡献

  • 提出了Fast-WAM,一种在测试时跳过未来预测的WAM架构。
  • 实验证明训练时的视频建模比测试时的未来预测对性能影响更大。
  • Fast-WAM在多个benchmark上达到SOTA水平,同时显著降低延迟。

方法论

设计Fast-WAM架构,保留训练时的视频协同训练,但在测试时跳过未来预测,并与其他变体进行对比实验。

原文摘要

World Action Models (WAMs) have emerged as a promising alternative to Vision-Language-Action (VLA) models for embodied control because they explicitly model how visual observations may evolve under action. Most existing WAMs follow an imagine-then-execute paradigm, incurring substantial test-time latency from iterative video denoising, yet it remains unclear whether explicit future imagination is actually necessary for strong action performance. In this paper, we ask whether WAMs need explicit future imagination at test time, or whether their benefit comes primarily from video modeling during training. We disentangle the role of video modeling during training from explicit future generation during inference by proposing \textbf{Fast-WAM}, a WAM architecture that retains video co-training during training but skips future prediction at test time. We further instantiate several Fast-WAM variants to enable a controlled comparison of these two factors. Across these variants, we find that Fast-WAM remains competitive with imagine-then-execute variants, while removing video co-training causes a much larger performance drop. Empirically, Fast-WAM achieves competitive results with state-of-the-art methods both on simulation benchmarks (LIBERO and RoboTwin) and real-world tasks, without embodied pretraining. It runs in real time with 190ms latency, over 4$\times$ faster than existing imagine-then-execute WAMs. These results suggest that the main value of video prediction in WAMs may lie in improving world representations during training rather than generating future observations at test time. Project page: https://yuantianyuan01.github.io/FastWAM/

标签

World Action Models Embodied Control Video Prediction Real-time Inference

arXiv 分类

cs.CV cs.AI