🤖 AI Summary
This work addresses the inefficiency of autoregressive OCR decoding in real-world document understanding tasks, where balancing performance and deployment cost remains challenging. The authors propose a compact 0.9B-parameter multimodal model integrating a 0.4B CogViT visual encoder and a 0.5B GLM language decoder, deployed within a two-stage pipeline: PP-DocLayout-V3 first performs layout analysis, followed by parallel content recognition across segmented regions. The key innovation is a multi-token prediction (MTP) mechanism that generates multiple tokens in a single decoding step, substantially improving throughput while effectively constraining memory overhead. Evaluated on both public benchmarks and industrial scenarios, the method achieves state-of-the-art or competitive performance across diverse tasks—including document parsing, text and formula transcription, table structure recovery, and key information extraction—demonstrating suitability for both edge deployment and large-scale production environments.
📝 Abstract
GLM-OCR is an efficient 0.9B-parameter compact multimodal model designed for real-world document understanding. It combines a 0.4B-parameter CogViT visual encoder with a 0.5B-parameter GLM language decoder, achieving a strong balance between computational efficiency and recognition performance. To address the inefficiency of standard autoregressive decoding in deterministic OCR tasks, GLM-OCR introduces a Multi-Token Prediction (MTP) mechanism that predicts multiple tokens per step, significantly improving decoding throughput while keeping memory overhead low through shared parameters. At the system level, a two-stage pipeline is adopted: PP-DocLayout-V3 first performs layout analysis, followed by parallel region-level recognition. Extensive evaluations on public benchmarks and industrial scenarios show that GLM-OCR achieves competitive or state-of-the-art performance in document parsing, text and formula transcription, table structure recovery, and key information extraction. Its compact architecture and structured generation make it suitable for both resource-constrained edge deployment and large-scale production systems.