🤖 AI Summary
Modern document layouts exhibit highly variable element counts and increasingly complex structures, causing performance degradation in conventional layout analysis methods. To address this, we propose a generative document layout analysis framework that synergistically integrates diffusion modeling with autoregressive decoding. To our knowledge, this is the first approach to jointly model iterative bounding-box refinement (via diffusion) and semantic-context-aware sequence generation (via autoregressive decoding) within a single, unified architecture. We further introduce a multi-scale feature fusion encoder to jointly capture fine-grained localization cues and high-level semantic structural information. Evaluated on DocLayNet and M⁶Doc benchmarks, our method achieves a new state-of-the-art mAP of 83.5%, substantially outperforming existing approaches. These results demonstrate the effectiveness and generalizability of the generative paradigm for complex document understanding.
📝 Abstract
Conventional document layout analysis (DLA) traditionally depends on empirical priors or a fixed set of learnable queries executed in a single forward pass. While sufficient for early-generation documents with a small, predetermined number of regions, this paradigm struggles with contemporary documents, which exhibit diverse element counts and increasingly complex layouts. To address challenges posed by modern documents, we present HybriDLA, a novel generative framework that unifies diffusion and autoregressive decoding within a single layer. The diffusion component iteratively refines bounding-box hypotheses, whereas the autoregressive component injects semantic and contextual awareness, enabling precise region prediction even in highly varied layouts. To further enhance detection quality, we design a multi-scale feature-fusion encoder that captures both fine-grained and high-level visual cues. This architecture elevates performance to 83.5% mean Average Precision (mAP). Extensive experiments on the DocLayNet and M$^6$Doc benchmarks demonstrate that HybriDLA sets a state-of-the-art performance, outperforming previous approaches. All data and models will be made publicly available at https://yufanchen96.github.io/projects/HybriDLA.