AI Agents 相关度: 9/10

PyVision-RL: Forging Open Agentic Vision Models via RL

Shitian Zhao, Shaoheng Lin, Ming Li, Haoquan Zhang, Wenshuo Peng, Kaipeng Zhang, Chen Wei

arXiv: 2602.20739v1 发布: 2026-02-24 更新: 2026-02-24

下载 PDF arXiv 页面

AI 摘要

PyVision-RL提出一种强化学习框架，解决多模态Agent中交互坍塌问题，提升工具使用和多轮推理能力。

主要贡献

提出PyVision-RL框架，稳定训练并维持Agent交互
结合过采样-过滤-排序 rollout策略和累积工具奖励，防止交互坍塌
开发PyVision-Image和PyVision-Video，用于图像和视频理解
提出按需上下文构建，显著减少视觉token使用

方法论

使用强化学习训练开放权重多模态模型，通过rollout策略和工具奖励来鼓励多轮交互和按需视觉处理。

原文摘要

Reinforcement learning for agentic multimodal models often suffers from interaction collapse, where models learn to reduce tool usage and multi-turn reasoning, limiting the benefits of agentic behavior. We introduce PyVision-RL, a reinforcement learning framework for open-weight multimodal models that stabilizes training and sustains interaction. Our approach combines an oversampling-filtering-ranking rollout strategy with an accumulative tool reward to prevent collapse and encourage multi-turn tool use. Using a unified training pipeline, we develop PyVision-Image and PyVision-Video for image and video understanding. For video reasoning, PyVision-Video employs on-demand context construction, selectively sampling task-relevant frames during reasoning to significantly reduce visual token usage. Experiments show strong performance and improved efficiency, demonstrating that sustained interaction and on-demand visual processing are critical for scalable multimodal agents.

arXiv 分类

cs.AI cs.CV

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类