NovaPlan: Zero-Shot Long-Horizon Manipulation via Closed-Loop Video Language Planning
AI 摘要
NovaPlan利用闭环视频语言规划,实现零样本长程机器人操作任务。
主要贡献
- 提出NovaPlan框架,融合VLM规划和几何机器人执行
- 利用视频生成提取关键点和手部姿态作为运动学先验
- 实现零样本长程操作任务,具备自主纠错能力
方法论
VLM规划器分解任务并监控执行,通过提取视频关键点和姿态信息驱动机器人执行。
原文摘要
Solving long-horizon tasks requires robots to integrate high-level semantic reasoning with low-level physical interaction. While vision-language models (VLMs) and video generation models can decompose tasks and imagine outcomes, they often lack the physical grounding necessary for real-world execution. We introduce NovaPlan, a hierarchical framework that unifies closed-loop VLM and video planning with geometrically grounded robot execution for zero-shot long-horizon manipulation. At the high level, a VLM planner decomposes tasks into sub-goals and monitors robot execution in a closed loop, enabling the system to recover from single-step failures through autonomous re-planning. To compute low-level robot actions, we extract and utilize both task-relevant object keypoints and human hand poses as kinematic priors from the generated videos, and employ a switching mechanism to choose the better one as a reference for robot actions, maintaining stable execution even under heavy occlusion or depth inaccuracy. We demonstrate the effectiveness of NovaPlan on three long-horizon tasks and the Functional Manipulation Benchmark (FMB). Our results show that NovaPlan can perform complex assembly tasks and exhibit dexterous error recovery behaviors without any prior demonstrations or training. Project page: https://nova-plan.github.io/