Multimodal Learning 相关度: 9/10

Guided Verifier: Collaborative Multimodal Reasoning via Dynamic Process Supervision

Lingzhuang Sun, Ruitong Liu, Yuxia Zhu, Xiaohan Xu, Jingxuan Wei, Xiangxiang Zhang, Bihui Yu, Wentao Zhang
arXiv: 2602.04290v1 发布: 2026-02-04 更新: 2026-02-04

AI 摘要

提出Guided Verifier框架,通过动态验证器实时监督MLLM推理过程,减少错误传播,提升推理能力。

主要贡献

  • 提出Guided Verifier框架,实现动态过程监督
  • 构建CoRe数据集用于训练验证器,针对多模态幻觉问题
  • 实验证明该方法能够有效提升MLLM在多模态推理任务上的性能

方法论

引入动态验证器与策略模型协同解决问题,实时检测不一致性并提供方向信号,使用合成数据训练验证器。

原文摘要

Reinforcement Learning (RL) has emerged as a pivotal mechanism for enhancing the complex reasoning capabilities of Multimodal Large Language Models (MLLMs). However, prevailing paradigms typically rely on solitary rollout strategies where the model works alone. This lack of intermediate oversight renders the reasoning process susceptible to error propagation, where early logical deviations cascade into irreversible failures, resulting in noisy optimization signals. In this paper, we propose the \textbf{Guided Verifier} framework to address these structural limitations. Moving beyond passive terminal rewards, we introduce a dynamic verifier that actively co-solves tasks alongside the policy. During the rollout phase, this verifier interacts with the policy model in real-time, detecting inconsistencies and providing directional signals to steer the model toward valid trajectories. To facilitate this, we develop a specialized data synthesis pipeline targeting multimodal hallucinations, constructing \textbf{CoRe} dataset of process-level negatives and \textbf{Co}rrect-guide \textbf{Re}asoning trajectories to train the guided verifier. Extensive experiments on MathVista, MathVerse and MMMU indicate that by allocating compute to collaborative inference and dynamic verification, an 8B-parameter model can achieve strong performance.

标签

Multimodal Reasoning Reinforcement Learning Verification

arXiv 分类

cs.CL