DocMamba: Efficient Document Pre-training with State Space Model

📅 2024-09-18
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF

career value

177K/year
🤖 AI Summary
To address the quadratic computational complexity of Transformer self-attention, which hinders efficient processing of long visual-rich documents, this work pioneers the integration of State Space Models (SSMs) into visual document understanding. We propose the Segment-First Bidirectional Scan (SFBS) mechanism, enabling global contextual modeling and continuous semantic capture in linear time while supporting length extrapolation. SFBS jointly encodes document image features and performs sequential modeling, substantially reducing GPU memory consumption and accelerating inference. Our approach achieves state-of-the-art performance on standard benchmarks—including FUNSD, CORD, and SROIE—and demonstrates exceptional generalization to long sequences in the HRDoc benchmark, validating its scalability and robustness for real-world document understanding tasks.

Technology Category

Application Category

📝 Abstract
In recent years, visually-rich document understanding has attracted increasing attention. Transformer-based pre-trained models have become the mainstream approach, yielding significant performance gains in this field. However, the self-attention mechanism's quadratic computational complexity hinders their efficiency and ability to process long documents. In this paper, we present DocMamba, a novel framework based on the state space model. It is designed to reduce computational complexity to linear while preserving global modeling capabilities. To further enhance its effectiveness in document processing, we introduce the Segment-First Bidirectional Scan (SFBS) to capture contiguous semantic information. Experimental results demonstrate that DocMamba achieves new state-of-the-art results on downstream datasets such as FUNSD, CORD, and SORIE, while significantly improving speed and reducing memory usage. Notably, experiments on the HRDoc confirm DocMamba's potential for length extrapolation.
Problem

Research questions and friction points this paper is trying to address.

Reduces computational complexity for document processing
Enhances global modeling capabilities in documents
Improves speed and reduces memory usage in models
Innovation

Methods, ideas, or system contributions that make the work stand out.

State space model reduces complexity linearly
Segment-First Bidirectional Scan captures semantics
DocMamba enhances speed and memory efficiency
🔎 Similar Papers
No similar papers found.
P
Pengfei Hu
NERC-SLIP, University of Science and Technology of China
Z
Zhenrong Zhang
NERC-SLIP, University of Science and Technology of China
J
Jie Ma
NERC-SLIP, University of Science and Technology of China
S
Shuhang Liu
NERC-SLIP, University of Science and Technology of China
J
Jun Du
NERC-SLIP, University of Science and Technology of China
J
Jianshu Zhang
IFLYTEK Research