SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus

📅 2025-10-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Spinal disorders affect 619 million people globally, yet AI-assisted diagnosis remains hindered by the scarcity of vertebral-level, multimodal (X-ray/CT/MRI) clinical data and standardized evaluation benchmarks. To address this, we introduce SpineMed-450k—the first large-scale, spine-specific instruction dataset comprising 450,000 vertebral-level instructions—and SpineBench, a clinically oriented evaluation benchmark. We propose a novel, physician-collaborative two-stage LLM generation paradigm (drafting + revision), integrating textbook knowledge with real-world cases to establish a traceable, spine-dedicated data curation pipeline. Experiments demonstrate that our approach significantly outperforms state-of-the-art models in multimodal vertebral localization and pathology identification. Clinical evaluations by practicing physicians confirm the diagnostic clarity and practical utility of model outputs, while also revealing systematic deficiencies in existing models’ anatomical-level reasoning capabilities.

Technology Category

Application Category

📝 Abstract
Spine disorders affect 619 million people globally and are a leading cause of disability, yet AI-assisted diagnosis remains limited by the lack of level-aware, multimodal datasets. Clinical decision-making for spine disorders requires sophisticated reasoning across X-ray, CT, and MRI at specific vertebral levels. However, progress has been constrained by the absence of traceable, clinically-grounded instruction data and standardized, spine-specific benchmarks. To address this, we introduce SpineMed, an ecosystem co-designed with practicing spine surgeons. It features SpineMed-450k, the first large-scale dataset explicitly designed for vertebral-level reasoning across imaging modalities with over 450,000 instruction instances, and SpineBench, a clinically-grounded evaluation framework. SpineMed-450k is curated from diverse sources, including textbooks, guidelines, open datasets, and ~1,000 de-identified hospital cases, using a clinician-in-the-loop pipeline with a two-stage LLM generation method (draft and revision) to ensure high-quality, traceable data for question-answering, multi-turn consultations, and report generation. SpineBench evaluates models on clinically salient axes, including level identification, pathology assessment, and surgical planning. Our comprehensive evaluation of several recently advanced large vision-language models (LVLMs) on SpineBench reveals systematic weaknesses in fine-grained, level-specific reasoning. In contrast, our model fine-tuned on SpineMed-450k demonstrates consistent and significant improvements across all tasks. Clinician assessments confirm the diagnostic clarity and practical utility of our model's outputs.
Problem

Research questions and friction points this paper is trying to address.

Addressing limited AI diagnosis for spine disorders
Overcoming lack of level-aware multimodal medical datasets
Providing clinically-grounded benchmarks for vertebral-level reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale multimodal dataset for vertebral-level reasoning
Clinician-in-the-loop LLM generation for traceable data
Clinically-grounded benchmark evaluating level-specific pathology assessment
🔎 Similar Papers
No similar papers found.
M
Ming Zhao
Jilin University,π3Lab
W
Wenhui Dong
Nanjing University,π3Lab
Xiang Zheng
Xiang Zheng
Department of Computer Science, City University of Hong Kong
Reinforcement LearningTrustworthy AIEmbodied AI
Z
Zhonghao Zhang
Ningxia University,π3Lab
Z
Zian Zhou
Zhejiang University,π3Lab
W
Wei Peng
Stanford University
J
Jianing Ni
π3Lab
Changjiang Jiang
Changjiang Jiang
Wuhan University
MLLMRl ReasoningDeep Research
L
Lixia Tian
Beijing Jiaotong University
P
Pingping Liu
Jilin University
Tongshun Zhang
Tongshun Zhang
College of Computer Science and Technology, Jilin University
Computer VisionImage EnhancementImage RestorationLow Light Enhancement
Z
Zhongan Bi
Zhejiang University,π3Lab
C
Chenyang Si
Nanjing University
Caifeng Shan
Caifeng Shan
Philips Research
Computer VisionPattern RecognitionMachine LearningImage/Video Analysis