Multimodal Learning 相关度: 8/10

Designing Production-Scale OCR for India: Multilingual and Domain-Specific Systems

Ali Faraz, Raja Kolla, Ashish Kulkarni, Shubham Agarwal
arXiv: 2602.16430v1 发布: 2026-02-18 更新: 2026-02-18

AI 摘要

论文针对印度场景设计高效OCR系统,提出两种训练策略并构建了两个SOTA模型。

主要贡献

  • 提出两种针对印度语境的多语言OCR训练策略
  • 构建了Chitrapathak系列OCR模型,并在Telugu上达到SOTA
  • 构建了Parichay系列OCR模型,用于识别印度政府文档
  • 提供了构建印度生产级OCR流水线的实践指导

方法论

论文对比了端到端训练和微调现有OCR模型两种方法,并针对特定领域训练OCR模型。

原文摘要

Designing Optical Character Recognition (OCR) systems for India requires balancing linguistic diversity, document heterogeneity, and deployment constraints. In this paper, we study two training strategies for building multilingual OCR systems with Vision-Language Models through the Chitrapathak series. We first follow a popular multimodal approach, pairing a generic vision encoder with a strong multilingual language model and training the system end-to-end for OCR. Alternatively, we explore fine-tuning an existing OCR model, despite not being trained for the target languages. Through extensive evaluation on multilingual Indic OCR benchmarks and deployment-oriented metrics, we find that the second strategy consistently achieves better accuracy-latency trade-offs. Chitrapathak-2 achieves 3-6x speedup over its predecessor with being state-of-the-art (SOTA) in Telugu (6.69 char ANLS) and second best in the rest. In addition, we present Parichay, an independent OCR model series designed specifically for 9 Indian government documents to extract structured key fields, achieving 89.8% Exact Match score with a faster inference. Together, these systems achieve SOTA performance and provide practical guidance for building production-scale OCR pipelines in the Indian context.

标签

OCR 多语言 印度 视觉语言模型 文档识别

arXiv 分类

cs.CV cs.AI