Multimodal Learning 相关度: 10/10

A Comprehensive Information-Decomposition Analysis of Large Vision-Language Models

Lixin Xiu, Xufang Luo, Hideki Nakayama

arXiv: 2603.29676v1 发布: 2026-03-31 更新: 2026-03-31

下载 PDF arXiv 页面

AI 摘要

该论文利用信息分解方法分析LVLM的决策过程，揭示其多模态融合和单模态先验依赖。

主要贡献

提出了一种使用部分信息分解(PID)的新框架，用于量化评估LVLM的信息谱。
揭示了两种任务模式（协同驱动 vs. 知识驱动）和两种模型策略（融合中心 vs. 语言中心）。
发现了层间处理的三阶段模式，并确定视觉指令微调是学习融合的关键阶段。

方法论

使用可扩展的PID估计器，分析26个LVLM在四个数据集上的表现，从广度、深度和时间三个维度进行剖析。

原文摘要

Large vision-language models (LVLMs) achieve impressive performance, yet their internal decision-making processes remain opaque, making it difficult to determine if the success stems from true multimodal fusion or from reliance on unimodal priors. To address this attribution gap, we introduce a novel framework using partial information decomposition (PID) to quantitatively measure the "information spectrum" of LVLMs -- decomposing a model's decision-relevant information into redundant, unique, and synergistic components. By adapting a scalable estimator to modern LVLM outputs, our model-agnostic pipeline profiles 26 LVLMs on four datasets across three dimensions -- breadth (cross-model & cross-task), depth (layer-wise information dynamics), and time (learning dynamics across training). Our analysis reveals two key results: (i) two task regimes (synergy-driven vs. knowledge-driven) and (ii) two stable, contrasting family-level strategies (fusion-centric vs. language-centric). We also uncover a consistent three-phase pattern in layer-wise processing and identify visual instruction tuning as the key stage where fusion is learned. Together, these contributions provide a quantitative lens beyond accuracy-only evaluation and offer insights for analyzing and designing the next generation of LVLMs. Code and data are available at https://github.com/RiiShin/pid-lvlm-analysis .

arXiv 分类

cs.LG cs.CL cs.CV

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类