Multimodal Learning 相关度: 9/10

GLM-OCR Technical Report

Shuaiqi Duan, Yadong Xue, Weihan Wang, Zhe Su, Huan Liu, Sheng Yang, Guobing Gan, Guo Wang, Zihan Wang, Shengdong Yan, Dexin Jin, Yuxuan Zhang, Guohong Wen, Yanfeng Wang, Yutao Zhang, Xiaohan Zhang, Wenyi Hong, Yukuo Cen, Da Yin, Bin Chen, Wenmeng Yu, Xiaotao Gu, Jie Tang
arXiv: 2603.10910v1 发布: 2026-03-11 更新: 2026-03-11

AI 摘要

GLM-OCR提出了一种高效的0.9B参数多模态模型,用于文档理解,具有高性能和高效率。

主要贡献

  • 提出Multi-Token Prediction机制加速解码
  • 采用PP-DocLayout-V3进行布局分析
  • 在多个文档理解任务上达到领先性能

方法论

结合CogViT视觉编码器和GLM语言解码器,采用两阶段流程:布局分析和并行区域级识别。

原文摘要

GLM-OCR is an efficient 0.9B-parameter compact multimodal model designed for real-world document understanding. It combines a 0.4B-parameter CogViT visual encoder with a 0.5B-parameter GLM language decoder, achieving a strong balance between computational efficiency and recognition performance. To address the inefficiency of standard autoregressive decoding in deterministic OCR tasks, GLM-OCR introduces a Multi-Token Prediction (MTP) mechanism that predicts multiple tokens per step, significantly improving decoding throughput while keeping memory overhead low through shared parameters. At the system level, a two-stage pipeline is adopted: PP-DocLayout-V3 first performs layout analysis, followed by parallel region-level recognition. Extensive evaluations on public benchmarks and industrial scenarios show that GLM-OCR achieves competitive or state-of-the-art performance in document parsing, text and formula transcription, table structure recovery, and key information extraction. Its compact architecture and structured generation make it suitable for both resource-constrained edge deployment and large-scale production systems.

标签

OCR 文档理解 多模态学习 GLM 视觉编码器

arXiv 分类

cs.CL