🤖 AI Summary
Existing medical large language model (LLM) evaluation benchmarks inadequately assess reasoning reliability, focusing solely on final answer accuracy while neglecting two critical safety dimensions: Chain-of-Thought (CoT) faithfulness—i.e., logical and factual consistency between reasoning steps and answers—and anti-sycophancy—the ability to resist misleading or manipulative prompts.
Method: We introduce the first safety-performance co-evaluation benchmark for medical LLMs, comprising 1,804 medical reasoning questions and seven categories of adversarial prompts. We establish a three-dimensional evaluation framework measuring accuracy, CoT faithfulness, and anti-sycophancy, and propose a novel 45° trade-off diagram to quantify the safety–performance balance.
Contribution/Results: Evaluating 189K+ reasoning traces across seven representative LLM families, we find none surpass the ideal safety–performance boundary. Among open-weight models, QwQ-32B achieves the best trade-off (43.81°). All code and data are publicly released.
📝 Abstract
With the increasing use of large language models (LLMs) in medical decision-support, it is essential to evaluate not only their final answers but also the reliability of their reasoning. Two key risks are Chain-of-Thought (CoT) faithfulness -- whether reasoning aligns with responses and medical facts -- and sycophancy, where models follow misleading cues over correctness. Existing benchmarks often collapse such vulnerabilities into single accuracy scores. To address this, we introduce MedOmni-45 Degrees, a benchmark and workflow designed to quantify safety-performance trade-offs under manipulative hint conditions. It contains 1,804 reasoning-focused medical questions across six specialties and three task types, including 500 from MedMCQA. Each question is paired with seven manipulative hint types and a no-hint baseline, producing about 27K inputs. We evaluate seven LLMs spanning open- vs. closed-source, general-purpose vs. medical, and base vs. reasoning-enhanced models, totaling over 189K inferences. Three metrics -- Accuracy, CoT-Faithfulness, and Anti-Sycophancy -- are combined into a composite score visualized with a 45 Degrees plot. Results show a consistent safety-performance trade-off, with no model surpassing the diagonal. The open-source QwQ-32B performs closest (43.81 Degrees), balancing safety and accuracy but not leading in both. MedOmni-45 Degrees thus provides a focused benchmark for exposing reasoning vulnerabilities in medical LLMs and guiding safer model development.