GLM-OCR Technical Report

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This work addresses the inefficiency of autoregressive OCR decoding in real-world document understanding tasks, where balancing performance and deployment cost remains challenging. The authors propose a compact 0.9B-parameter multimodal model integrating a 0.4B CogViT visual encoder and a 0.5B GLM language decoder, deployed within a two-stage pipeline: PP-DocLayout-V3 first performs layout analysis, followed by parallel content recognition across segmented regions. The key innovation is a multi-token prediction (MTP) mechanism that generates multiple tokens in a single decoding step, substantially improving throughput while effectively constraining memory overhead. Evaluated on both public benchmarks and industrial scenarios, the method achieves state-of-the-art or competitive performance across diverse tasks—including document parsing, text and formula transcription, table structure recovery, and key information extraction—demonstrating suitability for both edge deployment and large-scale production environments.

Technology Category

Application Category

📝 Abstract

GLM-OCR is an efficient 0.9B-parameter compact multimodal model designed for real-world document understanding. It combines a 0.4B-parameter CogViT visual encoder with a 0.5B-parameter GLM language decoder, achieving a strong balance between computational efficiency and recognition performance. To address the inefficiency of standard autoregressive decoding in deterministic OCR tasks, GLM-OCR introduces a Multi-Token Prediction (MTP) mechanism that predicts multiple tokens per step, significantly improving decoding throughput while keeping memory overhead low through shared parameters. At the system level, a two-stage pipeline is adopted: PP-DocLayout-V3 first performs layout analysis, followed by parallel region-level recognition. Extensive evaluations on public benchmarks and industrial scenarios show that GLM-OCR achieves competitive or state-of-the-art performance in document parsing, text and formula transcription, table structure recovery, and key information extraction. Its compact architecture and structured generation make it suitable for both resource-constrained edge deployment and large-scale production systems.

Problem

Research questions and friction points this paper is trying to address.

OCR

document understanding

autoregressive decoding

computational efficiency

multimodal model

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Token Prediction

Compact Multimodal Model

Structured Generation