PyVision-RL: Forging Open Agentic Vision Models via RL
AI 摘要
PyVision-RL提出一种强化学习框架,解决多模态Agent中交互坍塌问题,提升工具使用和多轮推理能力。
主要贡献
- 提出PyVision-RL框架,稳定训练并维持Agent交互
- 结合过采样-过滤-排序 rollout策略和累积工具奖励,防止交互坍塌
- 开发PyVision-Image和PyVision-Video,用于图像和视频理解
- 提出按需上下文构建,显著减少视觉token使用
方法论
使用强化学习训练开放权重多模态模型,通过rollout策略和工具奖励来鼓励多轮交互和按需视觉处理。
原文摘要
Reinforcement learning for agentic multimodal models often suffers from interaction collapse, where models learn to reduce tool usage and multi-turn reasoning, limiting the benefits of agentic behavior. We introduce PyVision-RL, a reinforcement learning framework for open-weight multimodal models that stabilizes training and sustains interaction. Our approach combines an oversampling-filtering-ranking rollout strategy with an accumulative tool reward to prevent collapse and encourage multi-turn tool use. Using a unified training pipeline, we develop PyVision-Image and PyVision-Video for image and video understanding. For video reasoning, PyVision-Video employs on-demand context construction, selectively sampling task-relevant frames during reasoning to significantly reduce visual token usage. Experiments show strong performance and improved efficiency, demonstrating that sustained interaction and on-demand visual processing are critical for scalable multimodal agents.