VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model
AI 摘要
VLA-JEPA通过无泄漏的状态预测,提升视觉-语言-动作模型在泛化性和鲁棒性方面的表现。
主要贡献
- 提出了VLA-JEPA预训练框架,解决像素变化导致的偏差。
- 引入了无泄漏状态预测,利用未来帧的潜在表示作为监督。
- 简化了训练流程,无需多阶段的复杂pipeline。
方法论
采用JEPA风格的预训练,目标编码器预测未来帧的潜在表示,学生网络仅观察当前信息。
原文摘要
Pretraining Vision-Language-Action (VLA) policies on internet-scale video is appealing, yet current latent-action objectives often learn the wrong thing: they remain anchored to pixel variation rather than action-relevant state transitions, making them vulnerable to appearance bias, nuisance motion, and information leakage. We introduce VLA-JEPA, a JEPA-style pretraining framework that sidesteps these pitfalls by design. The key idea is \emph{leakage-free state prediction}: a target encoder produces latent representations from future frames, while the student pathway sees only the current observation -- future information is used solely as supervision targets, never as input. By predicting in latent space rather than pixel space, VLA-JEPA learns dynamics abstractions that are robust to camera motion and irrelevant background changes. This yields a simple two-stage recipe -- JEPA pretraining followed by action-head fine-tuning -- without the multi-stage complexity of prior latent-action pipelines. Experiments on LIBERO, LIBERO-Plus, SimplerEnv and real-world manipulation tasks show that VLA-JEPA achieves consistent gains in generalization and robustness over existing methods.