GLM-OCR Technical Report
AI 摘要
GLM-OCR提出了一种高效的0.9B参数多模态模型,用于文档理解,具有高性能和高效率。
主要贡献
- 提出Multi-Token Prediction机制加速解码
- 采用PP-DocLayout-V3进行布局分析
- 在多个文档理解任务上达到领先性能
方法论
结合CogViT视觉编码器和GLM语言解码器,采用两阶段流程:布局分析和并行区域级识别。
原文摘要
GLM-OCR is an efficient 0.9B-parameter compact multimodal model designed for real-world document understanding. It combines a 0.4B-parameter CogViT visual encoder with a 0.5B-parameter GLM language decoder, achieving a strong balance between computational efficiency and recognition performance. To address the inefficiency of standard autoregressive decoding in deterministic OCR tasks, GLM-OCR introduces a Multi-Token Prediction (MTP) mechanism that predicts multiple tokens per step, significantly improving decoding throughput while keeping memory overhead low through shared parameters. At the system level, a two-stage pipeline is adopted: PP-DocLayout-V3 first performs layout analysis, followed by parallel region-level recognition. Extensive evaluations on public benchmarks and industrial scenarios show that GLM-OCR achieves competitive or state-of-the-art performance in document parsing, text and formula transcription, table structure recovery, and key information extraction. Its compact architecture and structured generation make it suitable for both resource-constrained edge deployment and large-scale production systems.