Multimodal Learning 相关度: 8/10

Error Patterns in Historical OCR: A Comparative Analysis of TrOCR and a Vision-Language Model

Ari Vesalainen, Eetu Mäkelä, Laura Ruotsalainen, Mikko Tolonen

arXiv: 2602.14524v1 发布: 2026-02-16 更新: 2026-02-16

下载 PDF arXiv 页面

AI 摘要

比较TrOCR和Qwen在历史文本OCR上的误差模式，分析其对学术研究的影响。

主要贡献

揭示了TrOCR和Qwen在历史文本OCR误差上的差异。
提出了基于假设的误差分析方法。
强调了架构感知评估在历史数字化工作流程中的重要性。

方法论

对比TrOCR和Qwen在古英语文本上的表现，使用长度加权准确率指标和基于假设的误差分析。

原文摘要

Optical Character Recognition (OCR) of eighteenth-century printed texts remains challenging due to degraded print quality, archaic glyphs, and non-standardized orthography. Although transformer-based OCR systems and Vision-Language Models (VLMs) achieve strong aggregate accuracy, metrics such as Character Error Rate (CER) and Word Error Rate (WER) provide limited insight into their reliability for scholarly use. We compare a dedicated OCR transformer (TrOCR) and a general-purpose Vision-Language Model (Qwen) on line-level historical English texts using length-weighted accuracy metrics and hypothesis driven error analysis. While Qwen achieves lower CER/WER and greater robustness to degraded input, it exhibits selective linguistic regularization and orthographic normalization that may silently alter historically meaningful forms. TrOCR preserves orthographic fidelity more consistently but is more prone to cascading error propagation. Our findings show that architectural inductive biases shape OCR error structure in systematic ways. Models with similar aggregate accuracy can differ substantially in error locality, detectability, and downstream scholarly risk, underscoring the need for architecture-aware evaluation in historical digitization workflows.

arXiv 分类

cs.CV

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类