AI Agents 相关度: 9/10

PyVision-RL: Forging Open Agentic Vision Models via RL

Shitian Zhao, Shaoheng Lin, Ming Li, Haoquan Zhang, Wenshuo Peng, Kaipeng Zhang, Chen Wei
arXiv: 2602.20739v1 发布: 2026-02-24 更新: 2026-02-24

AI 摘要

PyVision-RL提出一种强化学习框架,解决多模态Agent中交互坍塌问题,提升工具使用和多轮推理能力。

主要贡献

  • 提出PyVision-RL框架,稳定训练并维持Agent交互
  • 结合过采样-过滤-排序 rollout策略和累积工具奖励,防止交互坍塌
  • 开发PyVision-Image和PyVision-Video,用于图像和视频理解
  • 提出按需上下文构建,显著减少视觉token使用

方法论

使用强化学习训练开放权重多模态模型,通过rollout策略和工具奖励来鼓励多轮交互和按需视觉处理。

原文摘要

Reinforcement learning for agentic multimodal models often suffers from interaction collapse, where models learn to reduce tool usage and multi-turn reasoning, limiting the benefits of agentic behavior. We introduce PyVision-RL, a reinforcement learning framework for open-weight multimodal models that stabilizes training and sustains interaction. Our approach combines an oversampling-filtering-ranking rollout strategy with an accumulative tool reward to prevent collapse and encourage multi-turn tool use. Using a unified training pipeline, we develop PyVision-Image and PyVision-Video for image and video understanding. For video reasoning, PyVision-Video employs on-demand context construction, selectively sampling task-relevant frames during reasoning to significantly reduce visual token usage. Experiments show strong performance and improved efficiency, demonstrating that sustained interaction and on-demand visual processing are critical for scalable multimodal agents.

标签

Reinforcement Learning Multimodal Learning AI Agents Vision-Language Models Tool Use

arXiv 分类

cs.AI cs.CV