Multimodal Learning 相关度: 10/10

How Vision Becomes Language: A Layer-wise Information-Theoretic Analysis of Multimodal Reasoning

Hongxuan Wu, Yukun Zhang, Xueqing Zhou

arXiv: 2602.15580v1 发布: 2026-02-17 更新: 2026-02-17

下载 PDF arXiv 页面

AI 摘要

论文通过信息论方法分析多模态Transformer中视觉信息如何转化为语言。

主要贡献

提出了PID Flow，一种适用于高维神经表征的PID框架
揭示了多模态Transformer中模态转导模式：视觉信息早期主导，语言信息后期主导
通过干预实验验证了模态转导模式的因果关系

方法论

采用基于Partial Information Decomposition (PID)的层级分析框架，结合PID Flow处理高维数据，并进行干预实验验证因果关系。

原文摘要

When a multimodal Transformer answers a visual question, is the prediction driven by visual evidence, linguistic reasoning, or genuinely fused cross-modal computation -- and how does this structure evolve across layers? We address this question with a layer-wise framework based on Partial Information Decomposition (PID) that decomposes the predictive information at each Transformer layer into redundant, vision-unique, language-unique, and synergistic components. To make PID tractable for high-dimensional neural representations, we introduce \emph{PID Flow}, a pipeline combining dimensionality reduction, normalizing-flow Gaussianization, and closed-form Gaussian PID estimation. Applying this framework to LLaVA-1.5-7B and LLaVA-1.6-7B across six GQA reasoning tasks, we uncover a consistent \emph{modal transduction} pattern: visual-unique information peaks early and decays with depth, language-unique information surges in late layers to account for roughly 82\% of the final prediction, and cross-modal synergy remains below 2\%. This trajectory is highly stable across model variants (layer-wise correlations $>$0.96) yet strongly task-dependent, with semantic redundancy governing the detailed information fingerprint. To establish causality, we perform targeted Image$\rightarrow$Question attention knockouts and show that disrupting the primary transduction pathway induces predictable increases in trapped visual-unique information, compensatory synergy, and total information cost -- effects that are strongest in vision-dependent tasks and weakest in high-redundancy tasks. Together, these results provide an information-theoretic, causal account of how vision becomes language in multimodal Transformers, and offer quantitative guidance for identifying architectural bottlenecks where modality-specific information is lost.

arXiv 分类

cs.AI

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类