Multimodal Learning 相关度: 9/10

VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model

Jingwen Sun, Wenyao Zhang, Zekun Qi, Shaojie Ren, Zezhi Liu, Hanxin Zhu, Guangzhong Sun, Xin Jin, Zhibo Chen
arXiv: 2602.10098v1 发布: 2026-02-10 更新: 2026-02-10

AI 摘要

VLA-JEPA通过无泄漏的状态预测,提升视觉-语言-动作模型在泛化性和鲁棒性方面的表现。

主要贡献

  • 提出了VLA-JEPA预训练框架,解决像素变化导致的偏差。
  • 引入了无泄漏状态预测,利用未来帧的潜在表示作为监督。
  • 简化了训练流程,无需多阶段的复杂pipeline。

方法论

采用JEPA风格的预训练,目标编码器预测未来帧的潜在表示,学生网络仅观察当前信息。

原文摘要

Pretraining Vision-Language-Action (VLA) policies on internet-scale video is appealing, yet current latent-action objectives often learn the wrong thing: they remain anchored to pixel variation rather than action-relevant state transitions, making them vulnerable to appearance bias, nuisance motion, and information leakage. We introduce VLA-JEPA, a JEPA-style pretraining framework that sidesteps these pitfalls by design. The key idea is \emph{leakage-free state prediction}: a target encoder produces latent representations from future frames, while the student pathway sees only the current observation -- future information is used solely as supervision targets, never as input. By predicting in latent space rather than pixel space, VLA-JEPA learns dynamics abstractions that are robust to camera motion and irrelevant background changes. This yields a simple two-stage recipe -- JEPA pretraining followed by action-head fine-tuning -- without the multi-stage complexity of prior latent-action pipelines. Experiments on LIBERO, LIBERO-Plus, SimplerEnv and real-world manipulation tasks show that VLA-JEPA achieves consistent gains in generalization and robustness over existing methods.

标签

视觉-语言-动作模型 无监督学习 状态预测 预训练

arXiv 分类

cs.RO cs.CV