Multimodal Learning 相关度: 9/10

VLGOR: Visual-Language Knowledge Guided Offline Reinforcement Learning for Generalizable Agents

Pengsen Liu, Maosen Zeng, Nan Tang, Kaiyuan Li, Jing-Cheng Pang, Yunan Liu, Yang Yu
arXiv: 2603.22892v1 发布: 2026-03-24 更新: 2026-03-24

AI 摘要

VLGOR利用视觉-语言知识生成虚假轨迹,增强离线强化学习,提升智能体泛化能力。

主要贡献

  • 提出了VLGOR框架,融合视觉和语言知识
  • 使用视觉-语言模型预测未来状态和动作
  • 利用反事实提示生成更多样化的轨迹

方法论

通过微调视觉-语言模型生成具有时间和空间一致性的轨迹,并利用反事实提示增加数据多样性,进行离线强化学习。

原文摘要

Combining Large Language Models (LLMs) with Reinforcement Learning (RL) enables agents to interpret language instructions more effectively for task execution. However, LLMs typically lack direct perception of the physical environment, which limits their understanding of environmental dynamics and their ability to generalize to unseen tasks. To address this limitation, we propose Visual-Language Knowledge-Guided Offline Reinforcement Learning (VLGOR), a framework that integrates visual and language knowledge to generate imaginary rollouts, thereby enriching the interaction data. The core premise of VLGOR is to fine-tune a vision-language model to predict future states and actions conditioned on an initial visual observation and high-level instructions, ensuring that the generated rollouts remain temporally coherent and spatially plausible. Furthermore, we employ counterfactual prompts to produce more diverse rollouts for offline RL training, enabling the agent to acquire knowledge that facilitates following language instructions while grounding in environments based on visual cues. Experiments on robotic manipulation benchmarks demonstrate that VLGOR significantly improves performance on unseen tasks requiring novel optimal policies, achieving a success rate over 24% higher than the baseline methods.

标签

视觉-语言 强化学习 离线学习 机器人操作

arXiv 分类

cs.LG