Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing medical benchmarks over-rely on classification accuracy, failing to expose critical deficiencies of vision-language models (VLMs) in high-stakes clinical reasoning. To address this, we introduce Neural-MedBench—the first compact, deep-evaluation benchmark dedicated to multimodal neurological clinical reasoning. It integrates multi-sequence MRI, electronic health records, and clinical notes, covering three core tasks: differential diagnosis, lesion identification, and reasoning-based generation. We propose a novel dual-axis evaluation framework—assessing both breadth of generalization and reasoning fidelity—and design a hybrid evaluation pipeline combining large-model automated scoring, clinician validation, and embedded semantic similarity metrics. Experiments across state-of-the-art models—including GPT-4o, Claude-3, and MedGemma—reveal substantially lower performance on Neural-MedBench than on conventional benchmarks; error analysis confirms that flawed clinical reasoning—not superficial misclassification—is the primary failure mode. This strongly validates both the necessity of deep reasoning assessment and the effectiveness of our benchmark.

Technology Category

Application Category

📝 Abstract
Recent advances in vision-language models (VLMs) have achieved remarkable performance on standard medical benchmarks, yet their true clinical reasoning ability remains unclear. Existing datasets predominantly emphasize classification accuracy, creating an evaluation illusion in which models appear proficient while still failing at high-stakes diagnostic reasoning. We introduce Neural-MedBench, a compact yet reasoning-intensive benchmark specifically designed to probe the limits of multimodal clinical reasoning in neurology. Neural-MedBench integrates multi-sequence MRI scans, structured electronic health records, and clinical notes, and encompasses three core task families: differential diagnosis, lesion recognition, and rationale generation. To ensure reliable evaluation, we develop a hybrid scoring pipeline that combines LLM-based graders, clinician validation, and semantic similarity metrics. Through systematic evaluation of state-of-the-art VLMs, including GPT-4o, Claude-4, and MedGemma, we observe a sharp performance drop compared to conventional datasets. Error analysis shows that reasoning failures, rather than perceptual errors, dominate model shortcomings. Our findings highlight the necessity of a Two-Axis Evaluation Framework: breadth-oriented large datasets for statistical generalization, and depth-oriented, compact benchmarks such as Neural-MedBench for reasoning fidelity. We release Neural-MedBench at https://neuromedbench.github.io/ as an open and extensible diagnostic testbed, which guides the expansion of future benchmarks and enables rigorous yet cost-effective assessment of clinically trustworthy AI.
Problem

Research questions and friction points this paper is trying to address.

Assessing true clinical reasoning beyond classification accuracy in medical AI
Evaluating multimodal diagnostic reasoning in neurology using specialized benchmarks
Identifying reasoning failures versus perceptual errors in vision-language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Neural-MedBench integrates MRI scans and EHR data
Hybrid scoring combines LLM graders and clinician validation
Two-Axis Framework balances breadth and depth evaluation
🔎 Similar Papers
No similar papers found.
M
Miao Jing
Guangdong Institute of Intelligence Science and Technology, Hengqin, Zhuhai, Guangdong, China
M
Mengting Jia
Guangdong Institute of Intelligence Science and Technology, Hengqin, Zhuhai, Guangdong, China
J
Junling Lin
Beijing Chaoyang Hospital, Capital Medical University, Beijing, China
Z
Zhongxia Shen
Sleep Medical Center of Huzhou Third Municipal Hospital, the Affiliated Hospital of Wenzhou Medical University, Huzhou, China
Lijun Wang
Lijun Wang
Zhejiang University
Statistical LearningBioinformaticsAstrophysics
Yuanyuan Peng
Yuanyuan Peng
Soochow University
Huan Gao
Huan Gao
微软中国
自然语言处理
Mingkun Xu
Mingkun Xu
Tsinghua University
Brain-inspired ComputingSpiking Neural NetworkLLM/VLMAI4Science/HealthContinual Learning
Shangyang Li
Shangyang Li
Peking University
Computational NeuroscienceMachine learning