HybriDLA: Hybrid Generation for Document Layout Analysis

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Modern document layouts exhibit highly variable element counts and increasingly complex structures, causing performance degradation in conventional layout analysis methods. To address this, we propose a generative document layout analysis framework that synergistically integrates diffusion modeling with autoregressive decoding. To our knowledge, this is the first approach to jointly model iterative bounding-box refinement (via diffusion) and semantic-context-aware sequence generation (via autoregressive decoding) within a single, unified architecture. We further introduce a multi-scale feature fusion encoder to jointly capture fine-grained localization cues and high-level semantic structural information. Evaluated on DocLayNet and M⁶Doc benchmarks, our method achieves a new state-of-the-art mAP of 83.5%, substantially outperforming existing approaches. These results demonstrate the effectiveness and generalizability of the generative paradigm for complex document understanding.

Technology Category

Application Category

📝 Abstract

Conventional document layout analysis (DLA) traditionally depends on empirical priors or a fixed set of learnable queries executed in a single forward pass. While sufficient for early-generation documents with a small, predetermined number of regions, this paradigm struggles with contemporary documents, which exhibit diverse element counts and increasingly complex layouts. To address challenges posed by modern documents, we present HybriDLA, a novel generative framework that unifies diffusion and autoregressive decoding within a single layer. The diffusion component iteratively refines bounding-box hypotheses, whereas the autoregressive component injects semantic and contextual awareness, enabling precise region prediction even in highly varied layouts. To further enhance detection quality, we design a multi-scale feature-fusion encoder that captures both fine-grained and high-level visual cues. This architecture elevates performance to 83.5% mean Average Precision (mAP). Extensive experiments on the DocLayNet and M$^6$Doc benchmarks demonstrate that HybriDLA sets a state-of-the-art performance, outperforming previous approaches. All data and models will be made publicly available at https://yufanchen96.github.io/projects/HybriDLA.

Problem

Research questions and friction points this paper is trying to address.

Addresses limitations of conventional document layout analysis with modern complex layouts

Unifies diffusion and autoregressive decoding for precise region prediction

Enhances detection quality using multi-scale feature-fusion encoder

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid diffusion-autoregressive decoding for layout analysis

Multi-scale feature-fusion encoder captures visual cues

Unified framework refines bounding-box hypotheses iteratively

🔎 Similar Papers

No similar papers found.