🤖 AI Summary
This work addresses the suboptimal performance of large language models in endocrinology subspecialty clinical reasoning, where rapid guideline updates and complex evidence hierarchies pose significant challenges. To overcome this, the authors propose Mirror—a system that leverages a curated, high-quality knowledge base of endocrine and cardiometabolic evidence, integrated with a structured reasoning architecture and closed-domain constraints to generate traceable, high-accuracy responses without real-time web retrieval. In evaluation on a 120-question endocrinology exam, Mirror achieved 87.5% accuracy, significantly outperforming both human experts (62.3%) and GPT-5.2 (74.6%). On the most difficult 30 questions, it maintained 76.7% accuracy, with 74.2% of its answers citing guideline-level evidence and achieving 100% citation accuracy. This study demonstrates for the first time that combining expert-curated evidence with structured reasoning can surpass state-of-the-art large models—even those with live internet access—without external retrieval.
📝 Abstract
Background: Large language models have demonstrated strong performance on general medical examinations, but subspecialty clinical reasoning remains challenging due to rapidly evolving guidelines and nuanced evidence hierarchies. Methods: We evaluated January Mirror, an evidence-grounded clinical reasoning system, against frontier LLMs (GPT-5, GPT-5.2, Gemini-3-Pro) on a 120-question endocrinology board-style examination. Mirror integrates a curated endocrinology and cardiometabolic evidence corpus with a structured reasoning architecture to generate evidence-linked outputs. Mirror operated under a closed-evidence constraint without external retrieval. Comparator LLMs had real-time web access to guidelines and primary literature. Results: Mirror achieved 87.5% accuracy (105/120; 95% CI: 80.4-92.3%), exceeding a human reference of 62.3% and frontier LLMs including GPT-5.2 (74.6%), GPT-5 (74.0%), and Gemini-3-Pro (69.8%). On the 30 most difficult questions (human accuracy less than 50%), Mirror achieved 76.7% accuracy. Top-2 accuracy was 92.5% for Mirror versus 85.25% for GPT-5.2. Conclusions: Mirror provided evidence traceability: 74.2% of outputs cited at least one guideline-tier source, with 100% citation accuracy on manual verification. Curated evidence with explicit provenance can outperform unconstrained web retrieval for subspecialty clinical reasoning and supports auditability for clinical deployment.