Multimodal Learning 相关度: 9/10

Efficient Document Parsing via Parallel Token Prediction

Lei Li, Ze Zhao, Meng Li, Zhongwang Lun, Yi Yuan, Xingjing Lu, Zheng Wei, Jiang Bian, Zang Li

arXiv: 2603.15206v1 发布: 2026-03-16 更新: 2026-03-16

下载 PDF arXiv 页面

AI 摘要

论文提出了一种并行Token预测方法PTP，加速VLM文档解析，提升效率和泛化能力。

主要贡献

提出并行Token预测方法PTP，加速文档解析
设计数据生成流程，提供大规模高质量训练数据
实验证明PTP显著提高解码速度，减少幻觉，增强泛化能力

方法论

通过在输入序列中插入可学习的token，并设计相应的训练目标，使VLM具备并行解码文档解析的能力。

原文摘要

Document parsing, as a fundamental yet crucial vision task, is being revolutionized by vision-language models (VLMs). However, the autoregressive (AR) decoding inherent to VLMs creates a significant bottleneck, severely limiting parsing speed. In this paper, we propose Parallel-Token Prediction (PTP), a plugable, model-agnostic and simple-yet-effective method that enables VLMs to generate multiple future tokens in parallel with improved sample efficiency. Specifically, we insert some learnable tokens into the input sequence and design corresponding training objectives to equip the model with parallel decoding capabilities for document parsing. Furthermore, to support effective training, we develop a comprehensive data generation pipeline that efficiently produces large-scale, high-quality document parsing training data for VLMs. Extensive experiments on OmniDocBench and olmOCR-bench demonstrate that our method not only significantly improves decoding speed (1.6x-2.2x) but also reduces model hallucinations and exhibits strong generalization abilities.

arXiv 分类

cs.CL cs.CV

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类