🤖 AI Summary
Large language models are prone to “arithmetic hallucination” and “cognitive collapse” in high-complexity financial quantitative reasoning, limiting their reliability. To address this, this work introduces the first cognitive complexity benchmark (CCB) based on Chinese A-share annual reports and proposes an iterative two-stage Financial-PoT framework. By strictly decoupling semantic understanding from numerical computation and integrating a neuro-symbolic self-correction mechanism with a strategy that separates variable extraction from logical formalization, the framework enables robust reasoning within a Python sandbox environment. Experimental results demonstrate that the approach improves average accuracy from 59.7% to 67.3% on Qwen3-235B, with performance gains of up to tenfold on the most complex tasks.
📝 Abstract
While Large Language Models excel at semantic tasks, they face a critical bottleneck in financial quantitative reasoning, frequently suffering from"Arithmetic Hallucinations"and a systemic failure mode we term"Cognitive Collapse". To strictly quantify this phenomenon, we introduce the Cognitive Complexity Benchmark (CCB), a robust evaluation framework grounded in a dataset constructed from 95 real-world Chinese A-share annual reports. Unlike traditional datasets, the CCB stratifies financial queries into a three-dimensional taxonomy, Data Source, Mapping Difficulty, and Result Unit, enabling the precise diagnosis of reasoning degradation in high-cognitive-load scenarios. To address these failures, we propose the Iterative Dual-Phase Financial-PoT framework. This neuro-symbolic architecture enforces a strict architectural decoupling: it first isolates semantic variable extraction and logic formulation, then offloads computation to an iterative, self-correcting Python sandbox to ensure deterministic execution. Evaluation on the CCB demonstrates that while standard Chain-of-Thought falters on complex tasks, our approach offers superior robustness, elevating the Qwen3-235B model's average accuracy from 59.7\% to 67.3\% and achieving gains of up to 10-fold in high-complexity reasoning tasks. These findings suggest that architectural decoupling is a critical enabling factor for improving reliability in financial reasoning tasks, providing a transferable architectural insight for precision-critical domains that require tight alignment between semantic understanding and quantitative computation.