🤖 AI Summary
This study presents the first systematic evaluation of the effectiveness of iterative self-reflection in enabling large language models to correct errors in medical question answering. We compare standard chain-of-thought prompting with multi-turn self-reflection prompting on three benchmarks—MedQA, HeadQA, and PubMedQA—using GPT-4o and GPT-4o-mini. Results indicate that self-reflection does not consistently improve accuracy: it yields only marginal gains on MedQA and shows limited or even detrimental effects on HeadQA and PubMedQA, with additional reflection rounds failing to guarantee performance improvements. Our findings highlight a significant gap between reasoning transparency and answer correctness, offering crucial empirical evidence regarding the reliability of these models in safety-critical healthcare applications.
📝 Abstract
Large language models (LLMs) have achieved strong performance on medical question answering (medical QA), and chain-of-thought (CoT) prompting has further improved results by eliciting explicit intermediate reasoning; meanwhile, self-reflective (self-corrective) prompting has been widely claimed to enhance model reliability by prompting LLMs to critique and revise their own reasoning, yet its effectiveness in safety-critical medical settings remains unclear. In this work, we conduct an exploratory analysis of self-reflective reasoning for medical multiple-choice question answering: using GPT-4o and GPT-4o-mini, we compare standard CoT prompting with an iterative self-reflection loop and track how predictions evolve across reflection steps on three widely used medical QA benchmarks (MedQA, HeadQA, and PubMedQA). We analyze whether self-reflection leads to error correction, error persistence, or the introduction of new errors. Our results show that self-reflective prompting does not consistently improve accuracy and its impact is highly dataset- and model-dependent: it yields modest gains on MedQA but provides limited or negative benefits on HeadQA and PubMedQA, and increasing the number of reflection steps does not guarantee better performance. These findings highlight a gap between reasoning transparency and reasoning correctness, suggesting that self-reflective reasoning is better viewed as an analytical tool for understanding model behavior rather than a standalone solution for improving medical QA reliability.