LLM Memory & RAG 相关度: 6/10

OCRTurk: A Comprehensive OCR Benchmark for Turkish

Deniz Yılmaz, Evren Ayberk Munis, Çağrı Toraman, Süha Kağan Köse, Burak Aktaş, Mehmet Can Baytekin, Bilge Kaan Görür
arXiv: 2602.03693v1 发布: 2026-02-03 更新: 2026-02-03

AI 摘要

OCRTurk是一个土耳其语文档解析基准,包含多种文档类型和难度等级,评估了七个OCR模型。

主要贡献

  • 提出了OCRTurk土耳其语文档解析基准
  • 覆盖多种文档类型和布局元素
  • 评估了七个OCR模型在OCRTurk上的性能

方法论

构建包含180个土耳其语文档的数据集,并使用元素级指标评估七个OCR模型的性能,分析不同文档类型和难度下的表现。

原文摘要

Document parsing is now widely used in applications, such as large-scale document digitization, retrieval-augmented generation, and domain-specific pipelines in healthcare and education. Benchmarking these models is crucial for assessing their reliability and practical robustness. Existing benchmarks mostly target high-resource languages and provide limited coverage for low-resource settings, such as Turkish. Moreover, existing studies on Turkish document parsing lack a standardized benchmark that reflects real-world scenarios and document diversity. To address this gap, we introduce OCRTurk, a Turkish document parsing benchmark covering multiple layout elements and document categories at three difficulty levels. OCRTurk consists of 180 Turkish documents drawn from academic articles, theses, slide decks, and non-academic articles. We evaluate seven OCR models on OCRTurk using element-wise metrics. Across difficulty levels, PaddleOCR achieves the strongest overall results, leading most element-wise metrics except figures and attaining high Normalized Edit Distance scores in easy, medium, and hard subsets. We also observe performance variation by document type. Models perform well on non-academic documents, while slideshows become the most challenging.

标签

OCR 文档解析 土耳其语 基准测试 评估

arXiv 分类

cs.CL cs.AI