Multimodal Learning 相关度: 10/10

Where Vision Becomes Text: Locating the OCR Routing Bottleneck in Vision-Language Models

Jonathan Steinberg, Oren Gal
arXiv: 2602.22918v1 发布: 2026-02-26 更新: 2026-02-26

AI 摘要

该论文通过因果干预探究了视觉语言模型中OCR信息的路由机制和瓶颈位置。

主要贡献

  • 揭示了不同架构VLMs中OCR瓶颈的位置差异
  • 发现OCR信号是低维的且具有跨数据集的迁移性
  • 发现移除OCR在某些情况下可以提升计数性能

方法论

使用因果干预,计算原始图像和文本修复版本之间的激活差异,并使用PCA分析OCR信号。

原文摘要

Vision-language models (VLMs) can read text from images, but where does this optical character recognition (OCR) information enter the language processing stream? We investigate the OCR routing mechanism across three architecture families (Qwen3-VL, Phi-4, InternVL3.5) using causal interventions. By computing activation differences between original images and text-inpainted versions, we identify architecture-specific OCR bottlenecks whose dominant location depends on the vision-language integration strategy: DeepStack models (Qwen) show peak sensitivity at mid-depth (about 50%) for scene text, while single-stage projection models (Phi-4, InternVL) peak at early layers (6-25%), though the exact layer of maximum effect varies across datasets. The OCR signal is remarkably low-dimensional: PC1 captures 72.9% of variance. Crucially, principal component analysis (PCA) directions learned on one dataset transfer to others, demonstrating shared text-processing pathways. Surprisingly, in models with modular OCR circuits (notably Qwen3-VL-4B), OCR removal can improve counting performance (up to +6.9 percentage points), suggesting OCR interferes with other visual processing in sufficiently modular architectures.

标签

VLM OCR 因果干预 视觉语言模型

arXiv 分类

cs.CL