🤖 AI Summary
Existing medical benchmarks over-rely on classification accuracy, failing to expose critical deficiencies of vision-language models (VLMs) in high-stakes clinical reasoning. To address this, we introduce Neural-MedBench—the first compact, deep-evaluation benchmark dedicated to multimodal neurological clinical reasoning. It integrates multi-sequence MRI, electronic health records, and clinical notes, covering three core tasks: differential diagnosis, lesion identification, and reasoning-based generation. We propose a novel dual-axis evaluation framework—assessing both breadth of generalization and reasoning fidelity—and design a hybrid evaluation pipeline combining large-model automated scoring, clinician validation, and embedded semantic similarity metrics. Experiments across state-of-the-art models—including GPT-4o, Claude-3, and MedGemma—reveal substantially lower performance on Neural-MedBench than on conventional benchmarks; error analysis confirms that flawed clinical reasoning—not superficial misclassification—is the primary failure mode. This strongly validates both the necessity of deep reasoning assessment and the effectiveness of our benchmark.
📝 Abstract
Recent advances in vision-language models (VLMs) have achieved remarkable performance on standard medical benchmarks, yet their true clinical reasoning ability remains unclear. Existing datasets predominantly emphasize classification accuracy, creating an evaluation illusion in which models appear proficient while still failing at high-stakes diagnostic reasoning. We introduce Neural-MedBench, a compact yet reasoning-intensive benchmark specifically designed to probe the limits of multimodal clinical reasoning in neurology. Neural-MedBench integrates multi-sequence MRI scans, structured electronic health records, and clinical notes, and encompasses three core task families: differential diagnosis, lesion recognition, and rationale generation. To ensure reliable evaluation, we develop a hybrid scoring pipeline that combines LLM-based graders, clinician validation, and semantic similarity metrics. Through systematic evaluation of state-of-the-art VLMs, including GPT-4o, Claude-4, and MedGemma, we observe a sharp performance drop compared to conventional datasets. Error analysis shows that reasoning failures, rather than perceptual errors, dominate model shortcomings. Our findings highlight the necessity of a Two-Axis Evaluation Framework: breadth-oriented large datasets for statistical generalization, and depth-oriented, compact benchmarks such as Neural-MedBench for reasoning fidelity. We release Neural-MedBench at https://neuromedbench.github.io/ as an open and extensible diagnostic testbed, which guides the expansion of future benchmarks and enables rigorous yet cost-effective assessment of clinically trustworthy AI.