Multimodal Learning 相关度: 9/10

Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models

Zekai Ye, Qiming Li, Xiaocheng Feng, Ruihan Chen, Ziming Li, Haoyu Ren, Kun Chen, Dandan Tu, Bing Qin
arXiv: 2604.01840v1 发布: 2026-04-02 更新: 2026-04-02

AI 摘要

提出了一种感知驱动的策略优化PGPO,提升LVLM在多模态推理任务中的性能,通过动态调整token级别的优势函数。

主要贡献

  • 提出了Token Visual Dependency的概念,量化视觉输入的信息增益。
  • 引入了感知驱动的策略优化PGPO,动态重塑token级别的优势函数。
  • 通过实验证明PGPO在多个多模态推理基准测试中显著提升了LVLM的性能。

方法论

通过计算视觉条件和文本条件的KL散度,量化token的视觉依赖性,然后通过阈值门控机制,调整优势函数,增强视觉相关token的学习信号。

原文摘要

While Reinforcement Learning from Verifiable Rewards (RLVR) has advanced reasoning in Large Vision-Language Models (LVLMs), prevailing frameworks suffer from a foundational methodological flaw: by distributing identical advantages across all generated tokens, these methods inherently dilute the learning signals essential for optimizing the critical, visually-grounded steps of multimodal reasoning. To bridge this gap, we formulate \textit{Token Visual Dependency}, quantifying the causal information gain of visual inputs via the Kullback-Leibler (KL) divergence between visual-conditioned and text-only predictive distributions. Revealing that this dependency is highly sparse and semantically pivotal, we introduce Perception-Grounded Policy Optimization (PGPO), which is a novel fine-grained credit assignment framework that dynamically reshapes advantages at the token level. Through a threshold-gated, mass-conserving mechanism, PGPO actively amplifies learning signals for visually-dependent tokens while suppressing gradient noise from linguistic priors. Extensive experiments based on the Qwen2.5-VL series across seven challenging multimodal reasoning benchmarks demonstrate that PGPO boosts models by 18.7% on average. Both theoretical and empirical analyses confirm that PGPO effectively reduces gradient variance, prevents training collapse, and acts as a potent regularizer for robust, perception-grounded multimodal reasoning. Code will be published on https://github.com/Yzk1114/PGPO.

标签

多模态学习 视觉语言模型 强化学习 策略优化

arXiv 分类

cs.AI