DocPolarBERT: A Pre-trained Model for Document Understanding with Relative Polar Coordinate Encoding of Layout Structures

📅 2025-07-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Document understanding models relying on absolute 2D positional embeddings suffer from poor generalization, high computational overhead, and dependence on massive pretraining data. To address these limitations, this paper proposes DocPolarBERT, a novel architecture that replaces conventional Cartesian-coordinate-based absolute position embeddings with relative polar-coordinate positional encodings. It accordingly redesigns the self-attention mechanism to explicitly model directional and distance relationships among text blocks. By eliminating reliance on a global coordinate system, DocPolarBERT significantly enhances layout awareness. Built upon the BERT backbone, the model achieves state-of-the-art performance on multiple standard document understanding benchmarks—including FUNSD, CORD, and SROIE—despite being pretrained on a dataset orders of magnitude smaller than IIT-CDIP. This demonstrates the feasibility of efficient, layout-aware representation learning with substantially reduced data requirements.

Technology Category

Application Category

📝 Abstract
We introduce DocPolarBERT, a layout-aware BERT model for document understanding that eliminates the need for absolute 2D positional embeddings. We extend self-attention to take into account text block positions in relative polar coordinate system rather than the Cartesian one. Despite being pre-trained on a dataset more than six times smaller than the widely used IIT-CDIP corpus, DocPolarBERT achieves state-of-the-art results. These results demonstrate that a carefully designed attention mechanism can compensate for reduced pre-training data, offering an efficient and effective alternative for document understanding.
Problem

Research questions and friction points this paper is trying to address.

Eliminates need for absolute 2D positional embeddings
Uses relative polar coordinates for text block positions
Compensates reduced pre-training data with better attention
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses relative polar coordinate encoding
Extends self-attention for layout awareness
Achieves SOTA with smaller pre-training data
B
Benno Uthayasooriyar
Data Analytics Solutions, SCOR
A
Antoine Ly
Data Analytics Solutions, SCOR
F
Franck Vermet
Univ Brest, CNRS, UMR 6205, LMBA
Caio Corro
Caio Corro
INSA Rennes, IRISA
Natural Language ProcessingStructured Prediction