Multimodal Learning 相关度: 9/10

SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning

Philip Schroeder, Thomas Weng, Karl Schmeckpeper, Eric Rosen, Stephen Hart, Ondrej Biza
arXiv: 2603.28730v1 发布: 2026-03-30 更新: 2026-03-30

AI 摘要

SOLE-R1利用视频语言模型进行机器人强化学习,无需人工奖励。

主要贡献

  • 提出SOLE-R1模型,作为机器人强化学习的唯一奖励信号。
  • 开发大规模视频轨迹和推理合成流水线,生成时序对齐的CoT轨迹。
  • 在真实机器人环境中实现零样本在线强化学习。

方法论

基于视频语言模型,通过时空CoT推理生成奖励,结合监督微调和RL进行模型训练。

原文摘要

Vision-language models (VLMs) have shown impressive capabilities across diverse tasks, motivating efforts to leverage these models to supervise robot learning. However, when used as evaluators in reinforcement learning (RL), today's strongest models often fail under partial observability and distribution shift, enabling policies to exploit perceptual errors rather than solve the task. To address this limitation, we introduce SOLE-R1 (Self-Observing LEarner), a video-language reasoning model explicitly designed to serve as the sole reward signal for online RL. Given only raw video observations and a natural-language goal, SOLE-R1 performs per-timestep spatiotemporal chain-of-thought (CoT) reasoning and produces dense estimates of task progress that can be used directly as rewards. To train SOLE-R1, we develop a large-scale video trajectory and reasoning synthesis pipeline that generates temporally grounded CoT traces aligned with continuous progress supervision. This data is combined with foundational spatial and multi-frame temporal reasoning, and used to train the model with a hybrid framework that couples supervised fine-tuning with RL from verifiable rewards. Across four different simulation environments and a real-robot setting, SOLE-R1 enables zero-shot online RL from random initialization: robots learn previously unseen manipulation tasks without ground-truth rewards, success indicators, demonstrations, or task-specific tuning. SOLE-R1 succeeds on 24 unseen tasks and substantially outperforms strong vision-language rewarders, including GPT-5 and Gemini-3-Pro, while exhibiting markedly greater robustness to reward hacking.

标签

机器人学习 强化学习 视频语言模型 链式思考

arXiv 分类

cs.RO cs.CL cs.CV