KineVLA: Towards Kinematics-Aware Vision-Language-Action Models with Bi-Level Action Decomposition
AI 摘要
提出了KineVLA框架,通过双层动作分解实现对精细运动指令的理解与执行,并在数据集上验证了其优越性。
主要贡献
- 提出了一个富含运动学信息的VLA任务
- 提出了KineVLA框架,解耦目标层不变性和运动学层可变性
- 构建了包含模拟和真实机器人平台的运动学感知VLA数据集
方法论
采用双层动作表示和双层推理tokens,作为显式监督中间变量,对齐语言和动作,实现精细化的运动控制。
原文摘要
In this paper, we introduce a novel kinematics-rich vision-language-action (VLA) task, in which language commands densely encode diverse kinematic attributes (such as direction, trajectory, orientation, and relative displacement) from initiation through completion, at key moments, unlike existing action instructions that capture kinematics only coarsely or partially, thereby supporting fine-grained and personalized manipulation. In this setting, where task goals remain invariant while execution trajectories must adapt to instruction-level kinematic specifications. To address this challenge, we propose KineVLA, a vision-language-action framework that explicitly decouples goal-level invariance from kinematics-level variability through a bi-level action representation and bi-level reasoning tokens to serve as explicit, supervised intermediate variables that align language and action. To support this task, we construct the kinematics-aware VLA datasets spanning both simulation and real-world robotic platforms, featuring instruction-level kinematic variations and bi-level annotations. Extensive experiments on LIBERO and a Realman-75 robot demonstrate that KineVLA consistently outperforms strong VLA baselines on kinematics-sensitive benchmarks, achieving more precise, controllable, and generalizable manipulation behaviors.