AI Agents 相关度: 8/10

EVA: Aligning Video World Models with Executable Robot Actions via Inverse Dynamics Rewards

Ruixiang Wang, Qingming Liu, Yueci Deng, Guiliang Liu, Zhen Liu, Kui Jia
arXiv: 2603.17808v1 发布: 2026-03-18 更新: 2026-03-18

AI 摘要

EVA通过逆动力学奖励对齐视频世界模型和可执行机器人动作,减少动作执行中的不一致性。

主要贡献

  • 提出Executable Video Alignment (EVA)框架
  • 利用逆动力学模型作为奖励评估视频生成质量
  • 提升机器人任务执行成功率并减少伪影

方法论

使用真实机器人轨迹训练逆动力学模型,并将其用作奖励模型,通过强化学习对视频世界模型进行微调。

原文摘要

Video generative models are increasingly used as world models for robotics, where a model generates a future visual rollout conditioned on the current observation and task instruction, and an inverse dynamics model (IDM) converts the generated frames into executable robot actions. However, current video world models lack explicit executability constraints. As a result, visually coherent rollouts may still violate rigid-body and kinematic consistency, producing unstable or infeasible control commands when decoded by an IDM. We refer to this mismatch between visual generation and physically executable control as the executability gap. While this gap can be mitigated at inference time using techniques such as rejection sampling, such approaches are inefficient due to the high cost of video generation. In this paper, we leverage the executability gap as a training signal and introduce Executable Video Alignment (EVA), a reinforcement-learning post-training framework for aligning video world models. EVA trains an inverse dynamics model on real robot trajectories and repurposes it as a reward model that evaluates generated videos through the action sequences they induce, encouraging smooth motions measured by velocity, acceleration, and jerk while penalizing actions that violate embodiment constraints. Importantly, the reward remains informative even when generated videos contain severe visual artifacts, since such artifacts typically translate into unstable or out-of-bound actions. Experiments on the RoboTwin benchmark and a real bimanual robot show that EVA reduces embodiment-specific artifacts in generated rollouts and improves downstream task execution success.

标签

机器人 视频生成模型 逆动力学 强化学习

arXiv 分类

cs.RO cs.AI