MedOmni-45°: A Safety-Performance Benchmark for Reasoning-Oriented LLMs in Medicine

📅 2025-08-22

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Existing medical large language model (LLM) evaluation benchmarks inadequately assess reasoning reliability, focusing solely on final answer accuracy while neglecting two critical safety dimensions: Chain-of-Thought (CoT) faithfulness—i.e., logical and factual consistency between reasoning steps and answers—and anti-sycophancy—the ability to resist misleading or manipulative prompts. Method: We introduce the first safety-performance co-evaluation benchmark for medical LLMs, comprising 1,804 medical reasoning questions and seven categories of adversarial prompts. We establish a three-dimensional evaluation framework measuring accuracy, CoT faithfulness, and anti-sycophancy, and propose a novel 45° trade-off diagram to quantify the safety–performance balance. Contribution/Results: Evaluating 189K+ reasoning traces across seven representative LLM families, we find none surpass the ideal safety–performance boundary. Among open-weight models, QwQ-32B achieves the best trade-off (43.81°). All code and data are publicly released.

Technology Category

Application Category

📝 Abstract

With the increasing use of large language models (LLMs) in medical decision-support, it is essential to evaluate not only their final answers but also the reliability of their reasoning. Two key risks are Chain-of-Thought (CoT) faithfulness -- whether reasoning aligns with responses and medical facts -- and sycophancy, where models follow misleading cues over correctness. Existing benchmarks often collapse such vulnerabilities into single accuracy scores. To address this, we introduce MedOmni-45 Degrees, a benchmark and workflow designed to quantify safety-performance trade-offs under manipulative hint conditions. It contains 1,804 reasoning-focused medical questions across six specialties and three task types, including 500 from MedMCQA. Each question is paired with seven manipulative hint types and a no-hint baseline, producing about 27K inputs. We evaluate seven LLMs spanning open- vs. closed-source, general-purpose vs. medical, and base vs. reasoning-enhanced models, totaling over 189K inferences. Three metrics -- Accuracy, CoT-Faithfulness, and Anti-Sycophancy -- are combined into a composite score visualized with a 45 Degrees plot. Results show a consistent safety-performance trade-off, with no model surpassing the diagonal. The open-source QwQ-32B performs closest (43.81 Degrees), balancing safety and accuracy but not leading in both. MedOmni-45 Degrees thus provides a focused benchmark for exposing reasoning vulnerabilities in medical LLMs and guiding safer model development.

Problem

Research questions and friction points this paper is trying to address.

Evaluating reasoning reliability in medical LLMs beyond accuracy scores

Assessing Chain-of-Thought faithfulness and anti-sycophancy vulnerabilities

Quantifying safety-performance trade-offs under manipulative hint conditions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark with manipulative hints workflow

Composite score combining three key metrics

45 Degrees plot visualizing safety-performance tradeoffs

🔎 Similar Papers

No similar papers found.