Multimodal Learning 相关度: 9/10

Towards Generalizable Robotic Manipulation in Dynamic Environments

Heng Fang, Shangru Li, Shuhan Wang, Xuanyang Xi, Dingkang Liang, Xiang Bai
arXiv: 2603.15620v1 发布: 2026-03-16 更新: 2026-03-16

AI 摘要

论文提出了DOMINO数据集和PUMA模型,用于提升VLA模型在动态环境下的机器人操作能力。

主要贡献

  • 构建了大规模动态操作数据集DOMINO
  • 提出了动态感知VLA架构PUMA
  • 验证了动态数据训练对静态任务的迁移能力

方法论

构建动态操作数据集,并基于光流和世界查询,设计了可预测未来状态的VLA模型PUMA。

原文摘要

Vision-Language-Action (VLA) models excel in static manipulation but struggle in dynamic environments with moving targets. This performance gap primarily stems from a scarcity of dynamic manipulation datasets and the reliance of mainstream VLAs on single-frame observations, restricting their spatiotemporal reasoning capabilities. To address this, we introduce DOMINO, a large-scale dataset and benchmark for generalizable dynamic manipulation, featuring 35 tasks with hierarchical complexities, over 110K expert trajectories, and a multi-dimensional evaluation suite. Through comprehensive experiments, we systematically evaluate existing VLAs on dynamic tasks, explore effective training strategies for dynamic awareness, and validate the generalizability of dynamic data. Furthermore, we propose PUMA, a dynamics-aware VLA architecture. By integrating scene-centric historical optical flow and specialized world queries to implicitly forecast object-centric future states, PUMA couples history-aware perception with short-horizon prediction. Results demonstrate that PUMA achieves state-of-the-art performance, yielding a 6.3% absolute improvement in success rate over baselines. Moreover, we show that training on dynamic data fosters robust spatiotemporal representations that transfer to static tasks. All code and data are available at https://github.com/H-EmbodVis/DOMINO.

标签

机器人操作 动态环境 视觉语言动作模型 数据集

arXiv 分类

cs.CV cs.RO