Multimodal Learning 相关度: 9/10

Enhancing Multi-Modal LLMs Reasoning via Difficulty-Aware Group Normalization

Jinghan Li, Junfeng Fang, Jinda Lu, Yuan Wang, Xiaoyan Guo, Tianyu Zhang, Xiang Wang, Xiangnan He

arXiv: 2602.21743v1 发布: 2026-02-25 更新: 2026-02-25

下载 PDF arXiv 页面

AI 摘要

提出一种难度感知的分组归一化方法Durian，提升多模态LLM的推理能力。

主要贡献

提出了难度感知的分组归一化方法Durian
通过视觉熵和模型置信度来定义样本难度
解决了多模态LLM中std-based归一化的不稳定性问题

方法论

通过视觉熵和模型置信度定义样本难度，将样本按难度分组，在组内共享std进行归一化，缓解极端样本的影响。

原文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) and Group Relative Policy Optimization (GRPO) have significantly advanced the reasoning capabilities of large language models. Extending these methods to multimodal settings, however, faces a critical challenge: the instability of std-based normalization, which is easily distorted by extreme samples with nearly positive or negative rewards. Unlike pure-text LLMs, multimodal models are particularly sensitive to such distortions, as both perceptual and reasoning errors influence their responses. To address this, we characterize each sample by its difficulty, defined through perceptual complexity (measured via visual entropy) and reasoning uncertainty (captured by model confidence). Building on this characterization, we propose difficulty-aware group normalization (Durian), which re-groups samples by difficulty levels and shares the std within each group. Our approach preserves GRPO's intra-group distinctions while eliminating sensitivity to extreme cases, yielding significant performance gains across multiple multimodal reasoning benchmarks.

arXiv 分类

cs.CV

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类