Up to 36x Speedup: Mask-based Parallel Inference Paradigm for Key Information Extraction in MLLMs

📅 2026-01-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inefficiency of multimodal large language models in key information extraction tasks, where autoregressive inference hinders parallel processing of semantically independent fields. To overcome this limitation, the authors propose PIP (Parallel Inference Paradigm), a novel framework that introduces a [mask]-based generation mechanism to simultaneously predict all target fields in a single forward pass. By integrating tailored masked pretraining with large-scale supervised data, PIP achieves substantial inference speedups—ranging from 5× to 36×—while preserving near-perfect accuracy. This paradigm significantly enhances deployment efficiency in real-world applications without compromising model performance.

Technology Category

Application Category

📝 Abstract
Key Information Extraction (KIE) from visually-rich documents (VrDs) is a critical task, for which recent Large Language Models (LLMs) and Multi-Modal Large Language Models (MLLMs) have demonstrated strong potential. However, their reliance on autoregressive inference, which generates outputs sequentially, creates a significant efficiency bottleneck, especially as KIE tasks often involve extracting multiple, semantically independent fields. To overcome this limitation, we introduce PIP: a Parallel Inference Paradigm for KIE. Our approach reformulates the problem by using"[mask]"tokens as placeholders for all target values, enabling their simultaneous generation in a single forward pass. To facilitate this paradigm, we develop a tailored mask pre-training strategy and construct large-scale supervised datasets. Experimental results show that our PIP-models achieve a 5-36x inference speedup with negligible performance degradation compared to traditional autoregressive base models. By substantially improving efficiency while maintaining high accuracy, PIP paves the way for scalable and practical real-world KIE solutions.
Problem

Research questions and friction points this paper is trying to address.

Key Information Extraction
Multi-Modal Large Language Models
Autoregressive Inference
Inference Efficiency
Visually-Rich Documents
Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel Inference
Mask-based Generation
Key Information Extraction
Multi-Modal Large Language Models
Efficiency Optimization
🔎 Similar Papers
No similar papers found.
X
Xinzhong Wang
Shanghai Jiao Tong University
Y
Ya Guo
Ant Info Security Lab, Ant Group
J
Jing Li
Ant Info Security Lab, Ant Group
Huan Chen
Huan Chen
Shunfeng Technology Company Limited
Artificial IntelligenceFormal Methods
Yi Tu
Yi Tu
Ant Group
Computer VisionDocument UnderstandingVision Language Model
Y
Yijie Hong
Shanghai Jiao Tong University
G
Gongshen Liu
Shanghai Jiao Tong University; Inner Mongolia Research Institute, Shanghai Jiao Tong University, Hohhot 010010
H
Huijia Zhu
Ant Info Security Lab, Ant Group