Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Existing medical benchmarks over-rely on classification accuracy, failing to expose critical deficiencies of vision-language models (VLMs) in high-stakes clinical reasoning. To address this, we introduce Neural-MedBench—the first compact, deep-evaluation benchmark dedicated to multimodal neurological clinical reasoning. It integrates multi-sequence MRI, electronic health records, and clinical notes, covering three core tasks: differential diagnosis, lesion identification, and reasoning-based generation. We propose a novel dual-axis evaluation framework—assessing both breadth of generalization and reasoning fidelity—and design a hybrid evaluation pipeline combining large-model automated scoring, clinician validation, and embedded semantic similarity metrics. Experiments across state-of-the-art models—including GPT-4o, Claude-3, and MedGemma—reveal substantially lower performance on Neural-MedBench than on conventional benchmarks; error analysis confirms that flawed clinical reasoning—not superficial misclassification—is the primary failure mode. This strongly validates both the necessity of deep reasoning assessment and the effectiveness of our benchmark.

Technology Category

Application Category

📝 Abstract

Recent advances in vision-language models (VLMs) have achieved remarkable performance on standard medical benchmarks, yet their true clinical reasoning ability remains unclear. Existing datasets predominantly emphasize classification accuracy, creating an evaluation illusion in which models appear proficient while still failing at high-stakes diagnostic reasoning. We introduce Neural-MedBench, a compact yet reasoning-intensive benchmark specifically designed to probe the limits of multimodal clinical reasoning in neurology. Neural-MedBench integrates multi-sequence MRI scans, structured electronic health records, and clinical notes, and encompasses three core task families: differential diagnosis, lesion recognition, and rationale generation. To ensure reliable evaluation, we develop a hybrid scoring pipeline that combines LLM-based graders, clinician validation, and semantic similarity metrics. Through systematic evaluation of state-of-the-art VLMs, including GPT-4o, Claude-4, and MedGemma, we observe a sharp performance drop compared to conventional datasets. Error analysis shows that reasoning failures, rather than perceptual errors, dominate model shortcomings. Our findings highlight the necessity of a Two-Axis Evaluation Framework: breadth-oriented large datasets for statistical generalization, and depth-oriented, compact benchmarks such as Neural-MedBench for reasoning fidelity. We release Neural-MedBench at https://neuromedbench.github.io/ as an open and extensible diagnostic testbed, which guides the expansion of future benchmarks and enables rigorous yet cost-effective assessment of clinically trustworthy AI.

Problem

Research questions and friction points this paper is trying to address.

Assessing true clinical reasoning beyond classification accuracy in medical AI

Evaluating multimodal diagnostic reasoning in neurology using specialized benchmarks

Identifying reasoning failures versus perceptual errors in vision-language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Neural-MedBench integrates MRI scans and EHR data

Hybrid scoring combines LLM graders and clinician validation

Two-Axis Framework balances breadth and depth evaluation

🔎 Similar Papers

Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions