🤖 AI Summary
Existing document reasoning methods face an inherent trade-off between long-context modeling and fine-grained multimodal understanding. To address this, we propose a cognitively inspired coarse-to-fine two-stage reasoning framework that emulates human “rapid skimming followed by focused reasoning.” Methodologically: (1) we design a dual-stage architecture for global context localization and local deep reasoning; (2) we employ direct reinforcement learning—rather than supervised fine-tuning—for policy initialization to avoid strategy misalignment; and (3) we integrate multimodal representations with long-context adaptive attention. Evaluated on visual-rich document benchmarks, our 7B-parameter model achieves state-of-the-art performance, substantially outperforming larger closed-source models such as GPT-4o. To our knowledge, this is the first work to jointly optimize long-document comprehension and fine-grained multimodal reasoning at a small parameter scale.
📝 Abstract
Current document reasoning paradigms are constrained by a fundamental trade-off between scalability (processing long-context documents) and fidelity (capturing fine-grained, multimodal details). To bridge this gap, we propose CogDoc, a unified coarse-to-fine thinking framework that mimics human cognitive processes: a low-resolution "Fast Reading" phase for scalable information localization,followed by a high-resolution "Focused Thinking" phase for deep reasoning. We conduct a rigorous investigation into post-training strategies for the unified thinking framework, demonstrating that a Direct Reinforcement Learning (RL) approach outperforms RL with Supervised Fine-Tuning (SFT) initialization. Specifically, we find that direct RL avoids the "policy conflict" observed in SFT. Empirically, our 7B model achieves state-of-the-art performance within its parameter class, notably surpassing significantly larger proprietary models (e.g., GPT-4o) on challenging, visually rich document benchmarks.