CogDoc: Towards Unified thinking in Documents

📅 2025-12-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing document reasoning methods face an inherent trade-off between long-context modeling and fine-grained multimodal understanding. To address this, we propose a cognitively inspired coarse-to-fine two-stage reasoning framework that emulates human “rapid skimming followed by focused reasoning.” Methodologically: (1) we design a dual-stage architecture for global context localization and local deep reasoning; (2) we employ direct reinforcement learning—rather than supervised fine-tuning—for policy initialization to avoid strategy misalignment; and (3) we integrate multimodal representations with long-context adaptive attention. Evaluated on visual-rich document benchmarks, our 7B-parameter model achieves state-of-the-art performance, substantially outperforming larger closed-source models such as GPT-4o. To our knowledge, this is the first work to jointly optimize long-document comprehension and fine-grained multimodal reasoning at a small parameter scale.

Technology Category

Application Category

📝 Abstract
Current document reasoning paradigms are constrained by a fundamental trade-off between scalability (processing long-context documents) and fidelity (capturing fine-grained, multimodal details). To bridge this gap, we propose CogDoc, a unified coarse-to-fine thinking framework that mimics human cognitive processes: a low-resolution "Fast Reading" phase for scalable information localization,followed by a high-resolution "Focused Thinking" phase for deep reasoning. We conduct a rigorous investigation into post-training strategies for the unified thinking framework, demonstrating that a Direct Reinforcement Learning (RL) approach outperforms RL with Supervised Fine-Tuning (SFT) initialization. Specifically, we find that direct RL avoids the "policy conflict" observed in SFT. Empirically, our 7B model achieves state-of-the-art performance within its parameter class, notably surpassing significantly larger proprietary models (e.g., GPT-4o) on challenging, visually rich document benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Bridging scalability and fidelity in document reasoning
Proposing a unified coarse-to-fine thinking framework
Optimizing post-training strategies to avoid policy conflict
Innovation

Methods, ideas, or system contributions that make the work stand out.

Coarse-to-fine thinking framework mimics human cognition
Direct Reinforcement Learning outperforms SFT initialization
7B model surpasses larger models on visual document benchmarks
🔎 Similar Papers
No similar papers found.