🤖 AI Summary
This study identifies a critical mismatch in large language models (LLMs) applied to clinical pre-screening for rheumatoid arthritis (RA): high predictive accuracy coexists with low reasoning reliability. Method: We systematically evaluate LLMs using real-world patient data and expert blind review, and propose a multi-round LLM agent collaboration framework integrating clinical validation and domain-expert assessment. Contribution/Results: Empirical analysis reveals that while the best-performing model achieves 95% prediction accuracy, 68% of its reasoning traces contain substantive medical errors per blinded expert evaluation. This is the first study to empirically demonstrate that superficially high accuracy can mask profound clinical reasoning deficiencies—challenging prevailing assumptions about LLM interpretability and trustworthiness in healthcare. Our findings provide both a critical caution and a methodological foundation for the safe, reliable deployment of LLMs in high-stakes clinical decision-making.
📝 Abstract
Large language models (LLMs) offer a promising pre-screening tool, improving early disease detection and providing enhanced healthcare access for underprivileged communities. The early diagnosis of various diseases continues to be a significant challenge in healthcare, primarily due to the nonspecific nature of early symptoms, the shortage of expert medical practitioners, and the need for prolonged clinical evaluations, all of which can delay treatment and adversely affect patient outcomes. With impressive accuracy in prediction across a range of diseases, LLMs have the potential to revolutionize clinical pre-screening and decision-making for various medical conditions. In this work, we study the diagnostic capability of LLMs for Rheumatoid Arthritis (RA) with real world patients data. Patient data was collected alongside diagnoses from medical experts, and the performance of LLMs was evaluated in comparison to expert diagnoses for RA disease prediction. We notice an interesting pattern in disease diagnosis and find an unexpected extit{misalignment between prediction and explanation}. We conduct a series of multi-round analyses using different LLM agents. The best-performing model accurately predicts rheumatoid arthritis (RA) diseases approximately 95% of the time. However, when medical experts evaluated the reasoning generated by the model, they found that nearly 68% of the reasoning was incorrect. This study highlights a clear misalignment between LLMs high prediction accuracy and its flawed reasoning, raising important questions about relying on LLM explanations in clinical settings. extbf{LLMs provide incorrect reasoning to arrive at the correct answer for RA disease diagnosis.}