Multimodal Learning 相关度: 9/10

Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization

Ming Nie, Chunwei Wang, Jianhua Han, Hang Xu, Li Zhang
arXiv: 2603.09538v1 发布: 2026-03-10 更新: 2026-03-10

AI 摘要

提出了一种基于强化学习的后训练策略,提升统一视觉语言模型的多模态交错生成能力。

主要贡献

  • 提出了一种基于强化学习的后训练策略,无需大规模多模态交错数据集。
  • 提出了统一的策略优化框架,扩展了Group Relative Policy Optimization (GRPO)到多模态设置。
  • 设计了混合奖励机制,包含文本相关性、视觉-文本对齐和结构保真度,以及过程层面的奖励。

方法论

使用混合数据集进行预热,然后通过扩展的GRPO框架,利用混合奖励和过程奖励进行强化学习优化。

原文摘要

Unified vision-language models have made significant progress in multimodal understanding and generation, yet they largely fall short in producing multimodal interleaved outputs, which is a crucial capability for tasks like visual storytelling and step-by-step visual reasoning. In this work, we propose a reinforcement learning-based post-training strategy to unlock this capability in existing unified models, without relying on large-scale multimodal interleaved datasets. We begin with a warm-up stage using a hybrid dataset comprising curated interleaved sequences and limited data for multimodal understanding and text-to-image generation, which exposes the model to interleaved generation patterns while preserving its pretrained capabilities. To further refine interleaved generation, we propose a unified policy optimization framework that extends Group Relative Policy Optimization (GRPO) to the multimodal setting. Our approach jointly models text and image generation within a single decoding trajectory and optimizes it with our novel hybrid rewards covering textual relevance, visual-text alignment, and structural fidelity. Additionally, we incorporate process-level rewards to provide step-wise guidance, enhancing training efficiency in complex multimodal tasks. Experiments on MMIE and InterleavedBench demonstrate that our approach significantly enhances the quality and coherence of multimodal interleaved generation.

标签

多模态学习 强化学习 视觉语言模型 交错生成

arXiv 分类

cs.CV