🤖 AI Summary
Financial education question answering demands multi-step quantitative reasoning, domain-specific terminology comprehension, and realistic scenario modeling—capabilities poorly addressed by current large language models (LLMs). To bridge this gap, we propose a role-aware multi-agent framework that decomposes problem-solving into specialized roles (e.g., analyst, reviewer), integrating retrieval-augmented generation (RAG), role-specific prompting, and cross-model collaboration between generative and critical-review agents. This design significantly enhances multi-step reasoning fidelity and answer refinement for complex financial queries. Evaluated on a 3,532-question financial education benchmark, our method achieves a 6.6–8.3 percentage-point accuracy gain over zero-shot chain-of-thought baselines; notably, the GPT-4o-mini variant matches the performance of the domain-specialized FinGPT. Our core contribution is the first principled integration of role-based functional decomposition, critical multi-agent collaboration, and deep RAG—establishing a scalable paradigm for LLM deployment in specialized domains.
📝 Abstract
Question answering (QA) plays a central role in financial education, yet existing large language model (LLM) approaches often fail to capture the nuanced and specialized reasoning required for financial problem-solving. The financial domain demands multistep quantitative reasoning, familiarity with domain-specific terminology, and comprehension of real-world scenarios. We present a multi-agent framework that leverages role-based prompting to enhance performance on domain-specific QA. Our framework comprises a Base Generator, an Evidence Retriever, and an Expert Reviewer agent that work in a single-pass iteration to produce a refined answer. We evaluated our framework on a set of 3,532 expert-designed finance education questions from Study.com, an online learning platform. We leverage retrieval-augmented generation (RAG) for contextual evidence from 6 finance textbooks and prompting strategies for a domain-expert reviewer. Our experiments indicate that critique-based refinement improves answer accuracy by 6.6-8.3% over zero-shot Chain-of-Thought baselines, with the highest performance from Gemini-2.0-Flash. Furthermore, our method enables GPT-4o-mini to achieve performance comparable to the finance-tuned FinGPT-mt_Llama3-8B_LoRA. Our results show a cost-effective approach to enhancing financial QA and offer insights for further research in multi-agent financial LLM systems.