Multimodal Learning 相关度: 9/10

JEPA-VLA: Video Predictive Embedding is Needed for VLA Models

Shangchen Miao, Ningya Feng, Jialong Wu, Ye Lin, Xu He, Dong Li, Mingsheng Long
arXiv: 2602.11832v1 发布: 2026-02-12 更新: 2026-02-12

AI 摘要

该论文提出JEPA-VLA模型,通过融入视频预测嵌入提升VLA模型在机器人操作任务中的性能和泛化性。

主要贡献

  • 发现现有VLA模型视觉表示的局限性
  • 提出JEPA-VLA模型,融合视频预测嵌入
  • 实验证明JEPA-VLA在多个benchmark上提升性能

方法论

提出JEPA-VLA模型,自适应地将视频预测嵌入(V-JEPA 2)集成到现有VLA模型中,以提升环境理解和策略先验。

原文摘要

Recent vision-language-action (VLA) models built upon pretrained vision-language models (VLMs) have achieved significant improvements in robotic manipulation. However, current VLAs still suffer from low sample efficiency and limited generalization. This paper argues that these limitations are closely tied to an overlooked component, pretrained visual representation, which offers insufficient knowledge on both aspects of environment understanding and policy prior. Through an in-depth analysis, we find that commonly used visual representations in VLAs, whether pretrained via language-image contrastive learning or image-based self-supervised learning, remain inadequate at capturing crucial, task-relevant environment information and at inducing effective policy priors, i.e., anticipatory knowledge of how the environment evolves under successful task execution. In contrast, we discover that predictive embeddings pretrained on videos, in particular V-JEPA 2, are adept at flexibly discarding unpredictable environment factors and encoding task-relevant temporal dynamics, thereby effectively compensating for key shortcomings of existing visual representations in VLAs. Building on these observations, we introduce JEPA-VLA, a simple yet effective approach that adaptively integrates predictive embeddings into existing VLAs. Our experiments demonstrate that JEPA-VLA yields substantial performance gains across a range of benchmarks, including LIBERO, LIBERO-plus, RoboTwin2.0, and real-robot tasks.

标签

VLA 机器人操作 视频预测嵌入 视觉表示

arXiv 分类

cs.CV cs.RO