🤖 AI Summary
Prior large language models (LLMs) have underperformed on high-stakes, domain-specific professional certification exams—particularly the Chartered Financial Analyst (CFA) program—due to challenges in rigorous financial reasoning and structured problem-solving.
Method: This study conducts a systematic evaluation of state-of-the-art reasoning models—including Gemini 3.0 Pro, GPT-5, Claude Opus, Grok, and DeepSeek—on full-scale CFA Level I–III mock examinations comprising 980 expert-curated finance questions. To address reasoning limitations, we employ multi-step chain-of-thought prompting and structured answer generation.
Contribution/Results: We provide the first empirical evidence that most evaluated models achieve passing scores across all three CFA levels. Notably, Gemini 3.0 Pro attains 97.6% accuracy on Level I; GPT-5 and Gemini variants significantly surpass historical human average scores on Levels II and III. These findings demonstrate that next-generation reasoning models possess robust capabilities for complex financial analysis, offering foundational empirical validation and a methodological framework for AI-augmented professional credentialing.
📝 Abstract
Previous research has reported that large language models (LLMs) demonstrate poor performance on the Chartered Financial Analyst (CFA) exams. However, recent reasoning models have achieved strong results on graduate-level academic and professional examinations across various disciplines. In this paper, we evaluate state-of-the-art reasoning models on a set of mock CFA exams consisting of 980 questions across three Level I exams, two Level II exams, and three Level III exams. Using the same pass/fail criteria from prior studies, we find that most models clear all three levels. The models that pass, ordered by overall performance, are Gemini 3.0 Pro, Gemini 2.5 Pro, GPT-5, Grok 4, Claude Opus 4.1, and DeepSeek-V3.1. Specifically, Gemini 3.0 Pro achieves a record score of 97.6% on Level I. Performance is also strong on Level II, led by GPT-5 at 94.3%. On Level III, Gemini 2.5 Pro attains the highest score with 86.4% on multiple-choice questions while Gemini 3.0 Pro achieves 92.0% on constructed-response questions.