GLM-OCR Technical Report

📅 2026-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inefficiency of autoregressive OCR decoding in real-world document understanding tasks, where balancing performance and deployment cost remains challenging. The authors propose a compact 0.9B-parameter multimodal model integrating a 0.4B CogViT visual encoder and a 0.5B GLM language decoder, deployed within a two-stage pipeline: PP-DocLayout-V3 first performs layout analysis, followed by parallel content recognition across segmented regions. The key innovation is a multi-token prediction (MTP) mechanism that generates multiple tokens in a single decoding step, substantially improving throughput while effectively constraining memory overhead. Evaluated on both public benchmarks and industrial scenarios, the method achieves state-of-the-art or competitive performance across diverse tasks—including document parsing, text and formula transcription, table structure recovery, and key information extraction—demonstrating suitability for both edge deployment and large-scale production environments.

Technology Category

Application Category

📝 Abstract
GLM-OCR is an efficient 0.9B-parameter compact multimodal model designed for real-world document understanding. It combines a 0.4B-parameter CogViT visual encoder with a 0.5B-parameter GLM language decoder, achieving a strong balance between computational efficiency and recognition performance. To address the inefficiency of standard autoregressive decoding in deterministic OCR tasks, GLM-OCR introduces a Multi-Token Prediction (MTP) mechanism that predicts multiple tokens per step, significantly improving decoding throughput while keeping memory overhead low through shared parameters. At the system level, a two-stage pipeline is adopted: PP-DocLayout-V3 first performs layout analysis, followed by parallel region-level recognition. Extensive evaluations on public benchmarks and industrial scenarios show that GLM-OCR achieves competitive or state-of-the-art performance in document parsing, text and formula transcription, table structure recovery, and key information extraction. Its compact architecture and structured generation make it suitable for both resource-constrained edge deployment and large-scale production systems.
Problem

Research questions and friction points this paper is trying to address.

OCR
document understanding
autoregressive decoding
computational efficiency
multimodal model
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Token Prediction
Compact Multimodal Model
Structured Generation
Two-Stage OCR Pipeline
Efficient Autoregressive Decoding
🔎 Similar Papers
No similar papers found.