Multimodal Learning 相关度: 9/10

Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning

Ziqi Miao, Haonan Jia, Lijun Li, Chen Qian, Yuan Xiong, Wenting Yan, Jing Shao
arXiv: 2603.28618v1 发布: 2026-03-30 更新: 2026-03-30

AI 摘要

PRCO通过双角色强化学习,解耦感知与推理优化,提升多模态推理性能。

主要贡献

  • 提出了PRCO框架,解耦感知和推理的优化目标
  • 设计了观察者和解决者双角色,分别负责提取证据和预测答案
  • 引入角色特定的奖励信号,利用解决者的成功来指导观察者

方法论

采用双角色强化学习,观察者生成证据描述,解决者基于证据预测答案,利用角色特定奖励共同训练。

原文摘要

Reinforcement learning with verifiable rewards (RLVR) has substantially enhanced the reasoning capabilities of multimodal large language models (MLLMs). However, existing RLVR approaches typically rely on outcome-driven optimization that updates both perception and reasoning using a shared reward based solely on the final answer. This shared reward blurs credit assignment, frequently improving reasoning patterns while failing to reliably enhance the accuracy of upstream visual evidence extraction. To address this perception bottleneck, we introduce PRCO (Perception-Reasoning Coevolution), a dual-role RLVR framework with a shared policy. PRCO consists of two cooperative roles: an Observer that generates an evidence caption tailored to the question and a Solver that predicts the final answer based on this caption. Crucially, PRCO employs role-specific reward signals: the Solver is optimized using verifiable outcome rewards on the final answer, while the Observer receives a utility reward derived from the Solver's downstream success. Extensive experiments across eight challenging multimodal reasoning benchmarks demonstrate that PRCO yields consistent improvements across model scales by over 7 points on average accuracy compared to the base model, outperforming prior open-source RL-tuned baselines.

标签

多模态学习 强化学习 推理 视觉语言 VQA

arXiv 分类

cs.AI