Reading $\neq$ Seeing: Diagnosing and Closing the Typography Gap in Vision-Language Models
AI 摘要
VLMs在文本识别上表现出色,但在排版识别方面存在差距,论文对此进行了系统研究和改进。
主要贡献
- 发现了VLMs在排版识别上的差距,尤其是在字体样式方面
- 构建了评估VLMs排版能力的框架和数据集
- 通过LoRA微调提升了开源模型在字体识别上的性能
方法论
通过设计多种字体相关的识别任务,评估现有VLMs的性能,并使用LoRA微调进行改进。
原文摘要
Vision-Language Models achieve near-perfect accuracy at reading text in images, yet prove largely typography-blind: capable of recognizing what text says, but not how it looks. We systematically investigate this gap by evaluating font family, size, style, and color recognition across 26 fonts, four scripts, and three difficulty levels. Our evaluation of 15 state-of-the-art VLMs reveals a striking perception hierarchy: color recognition is near-perfect, yet font style detection remains universally poor. We further find that model scale fails to predict performance and that accuracy is uniform across difficulty levels, together pointing to a training-data omission rather than a capacity ceiling. LoRA fine-tuning on a small set of synthetic samples substantially improves an open-source model, narrowing the gap to the best closed-source system and surpassing it on font size recognition. Font style alone remains resistant to fine-tuning, suggesting that relational visual reasoning may require architectural innovation beyond current patch-based encoders. We release our evaluation framework, data, and fine-tuning recipe to support progress in closing the typographic gap in vision-language understanding.