Multimodal Learning 相关度: 9/10

OCR or Not? Rethinking Document Information Extraction in the MLLMs Era with Real-World Large-Scale Datasets

Jiyuan Shen, Peiyue Yuan, Atin Ghosh, Yifan Mai, Daniel Dahlmeier
arXiv: 2603.02789v1 发布: 2026-03-03 更新: 2026-03-03

AI 摘要

研究了MLLM在文档信息提取中是否需要OCR,发现强大MLLM可媲美OCR+MLLM。

主要贡献

  • 评估了MLLM在文档信息提取中的性能
  • 提出了自动化的分层错误分析框架
  • 发现image-only输入可达到与OCR增强方法相似的性能

方法论

大规模基准测试,评估现有MLLM在商业文档信息提取中的性能,并提出自动错误分析框架。

原文摘要

Multimodal Large Language Models (MLLMs) enhance the potential of natural language processing. However, their actual impact on document information extraction remains unclear. In particular, it is unclear whether an MLLM-only pipeline--while simpler--can truly match the performance of traditional OCR+MLLM setups. In this paper, we conduct a large-scale benchmarking study that evaluates various out-of-the-box MLLMs on business-document information extraction. To examine and explore failure modes, we propose an automated hierarchical error analysis framework that leverages large language models (LLMs) to diagnose error patterns systematically. Our findings suggest that OCR may not be necessary for powerful MLLMs, as image-only input can achieve comparable performance to OCR-enhanced approaches. Moreover, we demonstrate that carefully designed schema, exemplars, and instructions can further enhance MLLMs performance. We hope this work can offer practical guidance and valuable insight for advancing document information extraction.

标签

MLLM 文档信息提取 OCR 错误分析

arXiv 分类

cs.CL cs.AI