DocCogito: Aligning Layout Cognition and Step-Level Grounded Reasoning for Document Understanding

📅 2026-03-08

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

Existing multimodal document large language models lack a structured reasoning mechanism that integrates layout awareness with fine-grained, evidence-grounded alignment, resulting in opaque and unreliable inference. This work proposes DocCogito, a novel framework that explicitly couples layout perception with region-grounded structured reasoning. DocCogito employs a lightweight layout tower to extract global layout priors and replaces free-form chain-of-thought with deterministic Visual-Semantic Chains (VSCs) to supervise intermediate reasoning steps aligned with evidence regions. The framework further enhances layout-reasoning consistency through layout-aware pretraining, VSC-guided cold-start initialization, rejection sampling, GRPO-based reinforcement learning, and a region-confidence reward mechanism. Evaluated on six benchmarks—including DocVQA, WTQ, and ChartQA—DocCogito demonstrates strong generalization and achieves state-of-the-art performance on four tasks.

Technology Category

Application Category

📝 Abstract

Document understanding with multimodal large language models (MLLMs) requires not only accurate answers but also explicit, evidence-grounded reasoning, especially in high-stakes scenarios. However, current document MLLMs still fall short of forming a complete, human-like reasoning process, because even when they improve both layout encoding and CoT-style prompting, the interaction between the two is typically learned implicitly and remains loosely coupled rather than being enforced as a systematic mechanism. So we propose DocCogito, a unified framework that integrates global layout perception with structured, region-grounded reasoning. DocCogito introduces a lightweight layout tower that distills page structure into learnable global layout prior tokens, and a deterministic Visual-Semantic Chain (VSC)-a concise structured representation less ambiguous than free-form natural-language CoT-to supervise fine-grained intermediate reasoning aligned with evidence regions. Training follows a progressive recipe, including layout perception pretraining, VSC-guided cold start, rejection sampling, and GRPO. To further strengthen the internal coupling between layout priors and VSC execution, we augment standard rewards with a fine-grained region-confidence signal that encourages reasoning traces to stay aligned with corresponding evidence regions. Extensive experiments on six benchmarks (DocVQA, WTQ, ChartQA, TextVQA, OCRBench, and InfoVQA) demonstrate strong generalization, achieving state-of-the-art results on four benchmarks.

Problem

Research questions and friction points this paper is trying to address.

document understanding

layout cognition

grounded reasoning

multimodal large language models

reasoning alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

layout cognition

grounded reasoning

Visual-Semantic Chain