DocCogito: Aligning Layout Cognition and Step-Level Grounded Reasoning for Document Understanding

📅 2026-03-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal document large language models lack a structured reasoning mechanism that integrates layout awareness with fine-grained, evidence-grounded alignment, resulting in opaque and unreliable inference. This work proposes DocCogito, a novel framework that explicitly couples layout perception with region-grounded structured reasoning. DocCogito employs a lightweight layout tower to extract global layout priors and replaces free-form chain-of-thought with deterministic Visual-Semantic Chains (VSCs) to supervise intermediate reasoning steps aligned with evidence regions. The framework further enhances layout-reasoning consistency through layout-aware pretraining, VSC-guided cold-start initialization, rejection sampling, GRPO-based reinforcement learning, and a region-confidence reward mechanism. Evaluated on six benchmarks—including DocVQA, WTQ, and ChartQA—DocCogito demonstrates strong generalization and achieves state-of-the-art performance on four tasks.

Technology Category

Application Category

📝 Abstract
Document understanding with multimodal large language models (MLLMs) requires not only accurate answers but also explicit, evidence-grounded reasoning, especially in high-stakes scenarios. However, current document MLLMs still fall short of forming a complete, human-like reasoning process, because even when they improve both layout encoding and CoT-style prompting, the interaction between the two is typically learned implicitly and remains loosely coupled rather than being enforced as a systematic mechanism. So we propose DocCogito, a unified framework that integrates global layout perception with structured, region-grounded reasoning. DocCogito introduces a lightweight layout tower that distills page structure into learnable global layout prior tokens, and a deterministic Visual-Semantic Chain (VSC)-a concise structured representation less ambiguous than free-form natural-language CoT-to supervise fine-grained intermediate reasoning aligned with evidence regions. Training follows a progressive recipe, including layout perception pretraining, VSC-guided cold start, rejection sampling, and GRPO. To further strengthen the internal coupling between layout priors and VSC execution, we augment standard rewards with a fine-grained region-confidence signal that encourages reasoning traces to stay aligned with corresponding evidence regions. Extensive experiments on six benchmarks (DocVQA, WTQ, ChartQA, TextVQA, OCRBench, and InfoVQA) demonstrate strong generalization, achieving state-of-the-art results on four benchmarks.
Problem

Research questions and friction points this paper is trying to address.

document understanding
layout cognition
grounded reasoning
multimodal large language models
reasoning alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

layout cognition
grounded reasoning
Visual-Semantic Chain
multimodal LLMs
document understanding