Selective Training for Large Vision Language Models via Visual Information Gain
AI 摘要
论文提出一种基于视觉信息增益的选择性训练方法,提升LVLM的视觉 grounding 能力并缓解语言偏见。
主要贡献
- 提出视觉信息增益(VIG)度量视觉输入带来的预测不确定性减少
- 提出VIG引导的选择性训练方案,优先训练高VIG样本和tokens
- 通过选择性训练,显著降低监督成本并提升视觉 grounding 性能
方法论
计算视觉信息增益VIG,评估视觉输入对预测的影响。然后使用VIG指导选择高信息量样本和tokens进行训练。
原文摘要
Large Vision Language Models (LVLMs) have achieved remarkable progress, yet they often suffer from language bias, producing answers without relying on visual evidence. While prior work attempts to mitigate this issue through decoding strategies, architectural modifications, or curated instruction data, they typically lack a quantitative measure of how much individual training samples or tokens actually benefit from the image. In this work, we introduce Visual Information Gain (VIG), a perplexity-based metric that measures the reduction in prediction uncertainty provided by visual input. VIG enables fine-grained analysis at both sample and token levels, effectively highlighting visually grounded elements such as colors, spatial relations, and attributes. Leveraging this, we propose a VIG-guided selective training scheme that prioritizes high-VIG samples and tokens. This approach improves visual grounding and mitigates language bias, achieving superior performance with significantly reduced supervision by focusing exclusively on visually informative samples and tokens.