AI Agents 相关度: 8/10

SaPaVe: Towards Active Perception and Manipulation in Vision-Language-Action Models for Robotics

Mengzhen Liu, Enshen Zhou, Cheng Chi, Yi Han, Shanyu Rong, Liming Chen, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang
arXiv: 2603.12193v1 发布: 2026-03-12 更新: 2026-03-12

AI 摘要

SaPaVe框架通过解耦感知和操作,提升机器人视觉-语言-动作模型在主动操作任务中的性能。

主要贡献

  • 提出SaPaVe框架,解耦相机和机械臂动作
  • 构建ActiveViewPose-200K数据集,用于语义相机控制学习
  • 设计ActiveManip-Bench基准,用于评估主动操作能力

方法论

采用bottom-up训练策略,先训练相机控制,然后联合优化相机和机械臂动作,并使用3D几何感知模块。

原文摘要

Active perception and manipulation are crucial for robots to interact with complex scenes. Existing methods struggle to unify semantic-driven active perception with robust, viewpoint-invariant execution. We propose SaPaVe, an end-to-end framework that jointly learns these capabilities in a data-efficient manner. Our approach decouples camera and manipulation actions rather than placing them in a shared action space, and follows a bottom-up training strategy: we first train semantic camera control on a large-scale dataset, then jointly optimize both action types using hybrid data. To support this framework, we introduce ActiveViewPose-200K, a dataset of 200k image-language-camera movement pairs for semantic camera movement learning, and a 3D geometry-aware module that improves execution robustness under dynamic viewpoints. We also present ActiveManip-Bench, the first benchmark for evaluating active manipulation beyond fixed-view settings. Extensive experiments in both simulation and real-world environments show that SaPaVe outperforms recent vision-language-action models such as GR00T N1 and \(π_0\), achieving up to 31.25\% higher success rates in real-world tasks. These results show that tightly coupled perception and execution, when trained with decoupled yet coordinated strategies, enable efficient and generalizable active manipulation. Project page: https://lmzpai.github.io/SaPaVe

标签

机器人 视觉语言动作模型 主动感知 主动操作

arXiv 分类

cs.RO cs.CV