Universal Pose Pretraining for Generalizable Vision-Language-Action Policies
AI 摘要
Pose-VLA通过解耦和预训练,提升VLA模型在机器人任务上的泛化性和效率。
主要贡献
- 提出Pose-VLA解耦范式,分离空间先验学习和具体动作对齐
- 引入离散姿态token作为通用表示,融合3D数据和机器人轨迹
- 在RoboTwin 2.0和LIBERO上取得SOTA或具有竞争力的性能
方法论
两阶段预训练:首先通过姿态建立空间基础,然后通过轨迹监督进行运动对齐。
原文摘要
Existing Vision-Language-Action (VLA) models often suffer from feature collapse and low training efficiency because they entangle high-level perception with sparse, embodiment-specific action supervision. Since these models typically rely on VLM backbones optimized for Visual Question Answering (VQA), they excel at semantic identification but often overlook subtle 3D state variations that dictate distinct action patterns. To resolve these misalignments, we propose Pose-VLA, a decoupled paradigm that separates VLA training into a pre-training phase for extracting universal 3D spatial priors in a unified camera-centric space, and a post-training phase for efficient embodiment alignment within robot-specific action space. By introducing discrete pose tokens as a universal representation, Pose-VLA seamlessly integrates spatial grounding from diverse 3D datasets with geometry-level trajectories from robotic demonstrations. Our framework follows a two-stage pre-training pipeline, establishing fundamental spatial grounding via poses followed by motion alignment through trajectory supervision. Extensive evaluations demonstrate that Pose-VLA achieves state-of-the-art results on RoboTwin 2.0 with a 79.5% average success rate and competitive performance on LIBERO at 96.0%. Real-world experiments further showcase robust generalization across diverse objects using only 100 demonstrations per task, validating the efficiency of our pre-training paradigm.