Multimodal Learning 相关度: 9/10

Learning Self-Correction in Vision-Language Models via Rollout Augmentation

Yi Ding, Ziliang Qiu, Bolian Li, Ruqi Zhang
arXiv: 2602.08503v1 发布: 2026-02-09 更新: 2026-02-09

AI 摘要

该论文提出了一种通过rollout增强学习视觉语言模型自校正能力的方法,并在多个基准测试中取得了领先成果。

主要贡献

  • 提出了 correction-specific rollouts (Octopus) 框架,增强自校正示例
  • 引入 response-masking 策略,解耦自校正和直接推理
  • 构建了具有可控自校正能力的 reasoning VLM (Octopus-8B)

方法论

通过重组现有rollouts合成密集自校正示例,并采用response-masking策略分离自校正和推理,进行强化学习。

原文摘要

Self-correction is essential for solving complex reasoning problems in vision-language models (VLMs). However, existing reinforcement learning (RL) methods struggle to learn it, as effective self-correction behaviors emerge only rarely, making learning signals extremely sparse. To address this challenge, we propose correction-specific rollouts (Octopus), an RL rollout augmentation framework that synthesizes dense self-correction examples by recombining existing rollouts. This augmentation simultaneously improves sample efficiency due to rollout reuse and stabilizes RL optimization through balanced supervision. Furthermore, we introduce a response-masking strategy that decouples self-correction from direct reasoning, avoiding signal conflicts and enabling both behaviors to be learned effectively. Building on this, we introduce Octopus-8B, a reasoning VLM with controllable self-correction capability. Across 7 benchmarks, it achieves SoTA performance among open-source VLMs, outperforming the best RLVR baseline by 1.0 score while requiring only $0.72\times$ training time per step.

标签

Vision-Language Model Reinforcement Learning Self-Correction Rollout Augmentation

arXiv 分类

cs.CV cs.CL cs.LG