Multimodal Learning 相关度: 9/10

Nodes Are Early, Edges Are Late: Probing Diagram Representations in Large Vision-Language Models

Haruto Yoshida, Keito Kudo, Yoichi Aoki, Ryota Tanaka, Itsumi Saito, Keisuke Sakaguchi, Kentaro Inui
arXiv: 2603.02865v1 发布: 2026-03-03 更新: 2026-03-03

AI 摘要

该论文通过探针实验揭示LVLMs处理图结构数据时,节点和边信息编码的阶段性差异。

主要贡献

  • 发现LVLMs中节点信息在视觉编码器中较早编码,而边信息则较晚编码。
  • 揭示边信息在线性可分性方面在视觉编码器和语言模型中存在差异。
  • 提出边表示的延迟出现可能解释了LVLMs在关系理解方面的不足。

方法论

使用精心构建的合成图数据集,对LVLMs的内部表示进行探针实验,分析节点和边信息的编码阶段。

原文摘要

Large vision-language models (LVLMs) demonstrate strong performance on diagram understanding benchmarks, yet they still struggle with understanding relationships between elements, particularly those represented by nodes and directed edges (e.g., arrows and lines). To investigate the underlying causes of this limitation, we probe the internal representation of LVLMs using a carefully constructed synthetic diagram dataset based on directed graphs. Our probing experiments reveal that edge information is not linearly separable in the vision encoder and becomes linearly encoded only in the text tokens in the language model. In contrast, node information and global structural features are already linearly encoded in individual hidden states of the vision encoder. These findings suggest that the stage at which linearly separable representations are formed varies depending on the type of visual information. In particular, the delayed emergence of edge representations may help explain why LVLMs struggle with relational understanding, such as interpreting edge directions, which require more abstract, compositionally integrated processes.

标签

vision-language models diagram understanding probing relational reasoning directed graphs

arXiv 分类

cs.CL cs.CV