MedOmni-45°: A Safety-Performance Benchmark for Reasoning-Oriented LLMs in Medicine

📅 2025-08-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing medical large language model (LLM) evaluation benchmarks inadequately assess reasoning reliability, focusing solely on final answer accuracy while neglecting two critical safety dimensions: Chain-of-Thought (CoT) faithfulness—i.e., logical and factual consistency between reasoning steps and answers—and anti-sycophancy—the ability to resist misleading or manipulative prompts. Method: We introduce the first safety-performance co-evaluation benchmark for medical LLMs, comprising 1,804 medical reasoning questions and seven categories of adversarial prompts. We establish a three-dimensional evaluation framework measuring accuracy, CoT faithfulness, and anti-sycophancy, and propose a novel 45° trade-off diagram to quantify the safety–performance balance. Contribution/Results: Evaluating 189K+ reasoning traces across seven representative LLM families, we find none surpass the ideal safety–performance boundary. Among open-weight models, QwQ-32B achieves the best trade-off (43.81°). All code and data are publicly released.

Technology Category

Application Category

📝 Abstract
With the increasing use of large language models (LLMs) in medical decision-support, it is essential to evaluate not only their final answers but also the reliability of their reasoning. Two key risks are Chain-of-Thought (CoT) faithfulness -- whether reasoning aligns with responses and medical facts -- and sycophancy, where models follow misleading cues over correctness. Existing benchmarks often collapse such vulnerabilities into single accuracy scores. To address this, we introduce MedOmni-45 Degrees, a benchmark and workflow designed to quantify safety-performance trade-offs under manipulative hint conditions. It contains 1,804 reasoning-focused medical questions across six specialties and three task types, including 500 from MedMCQA. Each question is paired with seven manipulative hint types and a no-hint baseline, producing about 27K inputs. We evaluate seven LLMs spanning open- vs. closed-source, general-purpose vs. medical, and base vs. reasoning-enhanced models, totaling over 189K inferences. Three metrics -- Accuracy, CoT-Faithfulness, and Anti-Sycophancy -- are combined into a composite score visualized with a 45 Degrees plot. Results show a consistent safety-performance trade-off, with no model surpassing the diagonal. The open-source QwQ-32B performs closest (43.81 Degrees), balancing safety and accuracy but not leading in both. MedOmni-45 Degrees thus provides a focused benchmark for exposing reasoning vulnerabilities in medical LLMs and guiding safer model development.
Problem

Research questions and friction points this paper is trying to address.

Evaluating reasoning reliability in medical LLMs beyond accuracy scores
Assessing Chain-of-Thought faithfulness and anti-sycophancy vulnerabilities
Quantifying safety-performance trade-offs under manipulative hint conditions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark with manipulative hints workflow
Composite score combining three key metrics
45 Degrees plot visualizing safety-performance tradeoffs
🔎 Similar Papers
No similar papers found.
Kaiyuan Ji
Kaiyuan Ji
Shanghai AI Lab, East China Normal University
deep learningAI for Medicine
Y
Yijin Guo
Shanghai AI Laboratory
Z
Zicheng Zhang
Shanghai AI Laboratory
X
Xiangyang Zhu
Shanghai AI Laboratory
Y
Yuan Tian
Shanghai AI Laboratory
N
Ning Liu
School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University
Guangtao Zhai
Guangtao Zhai
Professor, IEEE Fellow, Shanghai Jiao Tong University
Multimedia Signal ProcessingVisual Quality AssessmentQoEAI EvaluationDisplays