🤖 AI Summary
Large language models (LLMs) exhibit poor performance on numerical question answering over financial documents containing both tables and text, while conventional critique agents rely on oracle labels and lack robustness. Method: This paper proposes a self-correcting multi-agent system comprising: (1) a robust, oracle-free critic mechanism that autonomously identifies numerical reasoning errors; (2) a novel collaborative calculator agent—decoupled from the LLM—to perform precise arithmetic computations; and (3) integrated techniques including programmatic chain-of-thought enhancement, dynamic self-correction, structured numerical parsing, and interactive reasoning-chain optimization. Results: On a financial document QA benchmark, our approach reduces error rate by 37% compared to Program-of-Thought (PoT), achieves 89.2% accuracy on numerical answers, and significantly improves reasoning safety and hallucination resistance.
📝 Abstract
Large language models (LLMs) have shown impressive capabilities on numerous natural language processing tasks. However, LLMs still struggle with numerical question answering for financial documents that include tabular and textual data. Recent works have showed the effectiveness of critic agents (i.e., self-correction) for this task given oracle labels. Building upon this framework, this paper examines the effectiveness of the traditional critic agent when oracle labels are not available, and show, through experiments, that this critic agent's performance deteriorates in this scenario. With this in mind, we present an improved critic agent, along with the calculator agent which outperforms the previous state-of-the-art approach (program-of-thought) and is safer. Furthermore, we investigate how our agents interact with each other, and how this interaction affects their performance.