Policy-based Tuning of Autoregressive Image Models with Instance- and Distribution-Level Rewards
AI 摘要
提出了一种基于强化学习的自回归图像模型微调框架,提升图像质量和多样性。
主要贡献
- 提出了一种新的分布级别Leave-One-Out FID (LOO-FID)奖励,用于鼓励多样性。
- 结合实例级别奖励(CLIP和HPSv2)以保证语义和感知保真度。
- 使用自适应熵正则化项稳定多目标学习。
方法论
将基于token的自回归合成视为马尔可夫决策过程,通过GRPO优化,结合LOO-FID和实例级别奖励。
原文摘要
Autoregressive (AR) models are highly effective for image generation, yet their standard maximum-likelihood estimation training lacks direct optimization for sample quality and diversity. While reinforcement learning (RL) has been used to align diffusion models, these methods typically suffer from output diversity collapse. Similarly, concurrent RL methods for AR models rely strictly on instance-level rewards, often trading off distributional coverage for quality. To address these limitations, we propose a lightweight RL framework that casts token-based AR synthesis as a Markov Decision Process, optimized via Group Relative Policy Optimization (GRPO). Our core contribution is the introduction of a novel distribution-level Leave-One-Out FID (LOO-FID) reward; by leveraging an exponential moving average of feature moments, it explicitly encourages sample diversity and prevents mode collapse during policy updates. We integrate this with composite instance-level rewards (CLIP and HPSv2) for strict semantic and perceptual fidelity, and stabilize the multi-objective learning with an adaptive entropy regularization term. Extensive experiments on LlamaGen and VQGAN architectures demonstrate clear improvements across standard quality and diversity metrics within only a few hundred tuning iterations. The results also show that the model can be updated to produce competitive samples even without Classifier-Free Guidance, and bypass its 2x inference cost.