Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization
AI 摘要
提出了一种基于强化学习的后训练策略,提升统一视觉语言模型的多模态交错生成能力。
主要贡献
- 提出了一种基于强化学习的后训练策略,无需大规模多模态交错数据集。
- 提出了统一的策略优化框架,扩展了Group Relative Policy Optimization (GRPO)到多模态设置。
- 设计了混合奖励机制,包含文本相关性、视觉-文本对齐和结构保真度,以及过程层面的奖励。
方法论
使用混合数据集进行预热,然后通过扩展的GRPO框架,利用混合奖励和过程奖励进行强化学习优化。
原文摘要
Unified vision-language models have made significant progress in multimodal understanding and generation, yet they largely fall short in producing multimodal interleaved outputs, which is a crucial capability for tasks like visual storytelling and step-by-step visual reasoning. In this work, we propose a reinforcement learning-based post-training strategy to unlock this capability in existing unified models, without relying on large-scale multimodal interleaved datasets. We begin with a warm-up stage using a hybrid dataset comprising curated interleaved sequences and limited data for multimodal understanding and text-to-image generation, which exposes the model to interleaved generation patterns while preserving its pretrained capabilities. To further refine interleaved generation, we propose a unified policy optimization framework that extends Group Relative Policy Optimization (GRPO) to the multimodal setting. Our approach jointly models text and image generation within a single decoding trajectory and optimizes it with our novel hybrid rewards covering textual relevance, visual-text alignment, and structural fidelity. Additionally, we incorporate process-level rewards to provide step-wise guidance, enhancing training efficiency in complex multimodal tasks. Experiments on MMIE and InterleavedBench demonstrate that our approach significantly enhances the quality and coherence of multimodal interleaved generation.