Multimodal Learning 相关度: 9/10

Real5-OmniDocBench: A Full-Scale Physical Reconstruction Benchmark for Robust Document Parsing in the Wild

Changda Zhou, Ziyue Gao, Xueqing Wang, Tingquan Gao, Cheng Cui, Jing Tang, Yi Liu
arXiv: 2603.04205v1 发布: 2026-03-04 更新: 2026-03-04

AI 摘要

构建了首个大规模文档解析物理重建基准Real5-OmniDocBench,用于评估VLM在真实场景下的鲁棒性。

主要贡献

  • 构建了Real5-OmniDocBench基准
  • 实现了OmniDocBench的完整物理重建
  • 提供了几何扭曲、光学伪像和模型限制的性能衰退归因分析

方法论

通过物理重建OmniDocBench,在扫描、扭曲等五种真实场景下评估VLM的文档解析性能,并分析性能衰退原因。

原文摘要

While Vision-Language Models (VLMs) achieve near-perfect scores on digital document benchmarks like OmniDocBench, their performance in the unpredictable physical world remains largely unknown due to the lack of controlled yet realistic evaluations. We introduce Real5-OmniDocBench, the first benchmark that performs a full-scale, one-to-one physical reconstruction of the entire OmniDocBench v1.5 (1,355 images) across five critical real-world scenarios: Scanning, Warping, Screen-Photography, Illumination, and Skew. Unlike prior benchmark that either lack digital correspondence or employ partial sampling, our complete ground-truth mapping enables, for the first time, rigorous factor-wise attribution of performance degradation-allowing us to pinpoint whether failures stem from geometric distortions, optical artifacts, or model limitations. Our benchmark establishes a challenging new standard for the community, demonstrating that the 'reality gap' in document parsing is far from closed, and provides a diagnostic tool to guide the development of truly resilient document intelligence.

标签

文档解析 视觉语言模型 基准测试 物理重建 鲁棒性

arXiv 分类

cs.CV