LaMP: Learning Vision-Language-Action Policies with 3D Scene Flow as Latent Motion Prior
AI 摘要
LaMP利用3D场景流作为运动先验,提升机器人操作任务中的视觉-语言-动作策略。
主要贡献
- 提出LaMP框架,融合视觉、语言和动作,利用3D场景流作为运动先验。
- 设计Motion Expert和Action Expert,通过门控交叉注意力进行信息融合。
- 在多个benchmark上超越现有VLA模型,并提升了对OOD扰动的鲁棒性。
方法论
LaMP通过Motion Expert预测3D场景流,并将其作为Action Expert的先验信息,以指导动作预测。
原文摘要
We introduce \textbf{LaMP}, a dual-expert Vision-Language-Action framework that embeds dense 3D scene flow as a latent motion prior for robotic manipulation. Existing VLA models regress actions directly from 2D semantic visual features, forcing them to learn complex 3D physical interactions implicitly. This implicit learning strategy degrades under unfamiliar spatial dynamics. LaMP addresses this limitation by aligning a flow-matching \emph{Motion Expert} with a policy-predicting \emph{Action Expert} through gated cross-attention. Specifically, the Motion Expert generates a one-step partially denoised 3D scene flow, and its hidden states condition the Action Expert without full multi-step reconstruction. We evaluate LaMP on the LIBERO, LIBERO-Plus, and SimplerEnv-WidowX simulation benchmarks as well as real-world experiments. LaMP consistently outperforms evaluated VLA baselines across LIBERO, LIBERO-Plus, and SimplerEnv-WidowX benchmarks, achieving the highest reported average success rates under the same training budgets. On LIBERO-Plus OOD perturbations, LaMP shows improved robustness with an average 9.7% gain over the strongest prior baseline. Our project page is available at https://summerwxk.github.io/lamp-project-page/.