Multimodal Learning 相关度: 8/10

PP-OCRv5: A Specialized 5M-Parameter Model Rivaling Billion-Parameter Vision-Language Models on OCR Tasks

Cheng Cui, Yubo Zhang, Ting Sun, Xueqing Wang, Hongen Liu, Manhui Lin, Yue Zhang, Tingquan Gao, Changda Zhou, Jiaxuan Liu, Zelun Zhang, Jing Zhang, Jun Zhang, Yi Liu
arXiv: 2603.24373v1 发布: 2026-03-25 更新: 2026-03-25

AI 摘要

PP-OCRv5以5M参数媲美数十亿参数VLM,强调高质量数据在OCR中的重要性。

主要贡献

  • 提出轻量级OCR系统PP-OCRv5
  • 系统性地研究了数据质量对OCR性能的影响
  • 证明了高质量数据可以提升传统OCR的上限

方法论

通过数据分析,优化训练数据,提高数据质量(难度、准确性、多样性),提升OCR性能。

原文摘要

The advent of "OCR 2.0" and large-scale vision-language models (VLMs) has set new benchmarks in text recognition. However, these unified architectures often come with significant computational demands, challenges in precise text localization within complex layouts, and a propensity for textual hallucinations. Revisiting the prevailing notion that model scale is the sole path to high accuracy, this paper introduces PP-OCRv5, a meticulously optimized, lightweight OCR system with merely 5 million parameters. We demonstrate that PP-OCRv5 achieves performance competitive with many billion-parameter VLMs on standard OCR benchmarks, while offering superior localization precision and reduced hallucinations. The cornerstone of our success lies not in architectural expansion but in a data-centric investigation. We systematically dissect the role of training data by quantifying three critical dimensions: data difficulty, data accuracy, and data diversity. Our extensive experiments reveal that with a sufficient volume of high-quality, accurately labeled, and diverse data, the performance ceiling for traditional, efficient two-stage OCR pipelines is far higher than commonly assumed. This work provides compelling evidence for the viability of lightweight, specialized models in the large-model era and offers practical insights into data curation for OCR. The source code and models are publicly available at https://github.com/PaddlePaddle/PaddleOCR.

标签

OCR 轻量级模型 数据质量 视觉语言模型

arXiv 分类

cs.CV