Evaluating Prompt Engineering Techniques for Accuracy and Confidence Elicitation in Medical LLMs

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This study systematically investigates the dual impact of prompt engineering on accuracy and confidence calibration of five large language models (LLMs) in Persian-language medical examinations, evaluating 156 distinct prompt configurations. Methodologically, it integrates temperature tuning, chain-of-thought (CoT), few-shot learning, affective prompting, and expert imitation, with multi-dimensional assessment via AUC-ROC, Brier Score, and Expected Calibration Error (ECE). Results reveal—novelly—that CoT and affective prompts improve accuracy yet significantly exacerbate overconfidence; closed-source models achieve higher accuracy but suffer poor calibration; Llama-3.1-8B consistently underperforms; and all models exhibit confidence–performance mismatch. The core contribution is a novel “accuracy–uncertainty co-optimization” paradigm for dual-objective prompt design, advancing trustworthy deployment of AI in clinical decision support.

Technology Category

Application Category

📝 Abstract

This paper investigates how prompt engineering techniques impact both accuracy and confidence elicitation in Large Language Models (LLMs) applied to medical contexts. Using a stratified dataset of Persian board exam questions across multiple specialties, we evaluated five LLMs - GPT-4o, o3-mini, Llama-3.3-70b, Llama-3.1-8b, and DeepSeek-v3 - across 156 configurations. These configurations varied in temperature settings (0.3, 0.7, 1.0), prompt styles (Chain-of-Thought, Few-Shot, Emotional, Expert Mimicry), and confidence scales (1-10, 1-100). We used AUC-ROC, Brier Score, and Expected Calibration Error (ECE) to evaluate alignment between confidence and actual performance. Chain-of-Thought prompts improved accuracy but also led to overconfidence, highlighting the need for calibration. Emotional prompting further inflated confidence, risking poor decisions. Smaller models like Llama-3.1-8b underperformed across all metrics, while proprietary models showed higher accuracy but still lacked calibrated confidence. These results suggest prompt engineering must address both accuracy and uncertainty to be effective in high-stakes medical tasks.

Problem

Research questions and friction points this paper is trying to address.

Evaluating prompt engineering's impact on medical LLM accuracy and confidence

Assessing calibration between confidence and performance in medical LLMs

Identifying optimal prompt styles for reliable medical decision-making

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated multiple prompt styles for medical LLMs

Assessed confidence calibration using AUC-ROC and ECE

Found Chain-of-Thought improves accuracy but overconfidence

🔎 Similar Papers

No similar papers found.