🤖 AI Summary
Existing multimodal document large language models lack a structured reasoning mechanism that integrates layout awareness with fine-grained, evidence-grounded alignment, resulting in opaque and unreliable inference. This work proposes DocCogito, a novel framework that explicitly couples layout perception with region-grounded structured reasoning. DocCogito employs a lightweight layout tower to extract global layout priors and replaces free-form chain-of-thought with deterministic Visual-Semantic Chains (VSCs) to supervise intermediate reasoning steps aligned with evidence regions. The framework further enhances layout-reasoning consistency through layout-aware pretraining, VSC-guided cold-start initialization, rejection sampling, GRPO-based reinforcement learning, and a region-confidence reward mechanism. Evaluated on six benchmarks—including DocVQA, WTQ, and ChartQA—DocCogito demonstrates strong generalization and achieves state-of-the-art performance on four tasks.
📝 Abstract
Document understanding with multimodal large language models (MLLMs) requires not only accurate answers but also explicit, evidence-grounded reasoning, especially in high-stakes scenarios. However, current document MLLMs still fall short of forming a complete, human-like reasoning process, because even when they improve both layout encoding and CoT-style prompting, the interaction between the two is typically learned implicitly and remains loosely coupled rather than being enforced as a systematic mechanism. So we propose DocCogito, a unified framework that integrates global layout perception with structured, region-grounded reasoning. DocCogito introduces a lightweight layout tower that distills page structure into learnable global layout prior tokens, and a deterministic Visual-Semantic Chain (VSC)-a concise structured representation less ambiguous than free-form natural-language CoT-to supervise fine-grained intermediate reasoning aligned with evidence regions. Training follows a progressive recipe, including layout perception pretraining, VSC-guided cold start, rejection sampling, and GRPO. To further strengthen the internal coupling between layout priors and VSC execution, we augment standard rewards with a fine-grained region-confidence signal that encourages reasoning traces to stay aligned with corresponding evidence regions. Extensive experiments on six benchmarks (DocVQA, WTQ, ChartQA, TextVQA, OCRBench, and InfoVQA) demonstrate strong generalization, achieving state-of-the-art results on four benchmarks.