SpineBench: Benchmarking Multimodal LLMs for Spinal Pathology Analysis

📅 2025-10-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing medical multimodal large language models (MLLMs) lack specialized benchmarks for highly visual pathology domains such as spinal disorders. Method: We introduce SpineBench—the first dedicated benchmark for spinal pathology analysis—comprising 40,263 spinal images and 64,878 question-answer pairs across 11 disease categories, supporting both diagnostic classification and lesion localization. It innovatively incorporates visually similar hard negative answer options to better emulate clinical differential diagnosis challenges, and unifies and standardizes multiple open-source spinal datasets into high-quality multiple-choice visual question answering (VQA) format. Contribution/Results: Comprehensive evaluation across 12 state-of-the-art MLLMs reveals severe performance limitations on spinal tasks (average accuracy: 52.3%), exposing bottlenecks in fine-grained visual understanding and medical knowledge integration. SpineBench establishes a reproducible, scalable evaluation paradigm and provides concrete directions for targeted optimization of medical AI systems.

Technology Category

Application Category

📝 Abstract
With the increasing integration of Multimodal Large Language Models (MLLMs) into the medical field, comprehensive evaluation of their performance in various medical domains becomes critical. However, existing benchmarks primarily assess general medical tasks, inadequately capturing performance in nuanced areas like the spine, which relies heavily on visual input. To address this, we introduce SpineBench, a comprehensive Visual Question Answering (VQA) benchmark designed for fine-grained analysis and evaluation of MLLMs in the spinal domain. SpineBench comprises 64,878 QA pairs from 40,263 spine images, covering 11 spinal diseases through two critical clinical tasks: spinal disease diagnosis and spinal lesion localization, both in multiple-choice format. SpineBench is built by integrating and standardizing image-label pairs from open-source spinal disease datasets, and samples challenging hard negative options for each VQA pair based on visual similarity (similar but not the same disease), simulating real-world challenging scenarios. We evaluate 12 leading MLLMs on SpineBench. The results reveal that these models exhibit poor performance in spinal tasks, highlighting limitations of current MLLM in the spine domain and guiding future improvements in spinal medicine applications. SpineBench is publicly available at https://zhangchenghanyu.github.io/SpineBench.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLM performance in spinal pathology analysis
Addressing lack of specialized benchmarks for spinal diseases
Assessing visual reasoning for diagnosis and lesion localization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces SpineBench VQA benchmark for spinal analysis
Uses 64,878 QA pairs from 40,263 spine images
Implements hard negative sampling based on visual similarity
🔎 Similar Papers
No similar papers found.
C
Chenghanyu Zhang
Beijing University of Posts and Telecommunications, Beijing, China
Z
Zekun Li
University of California, Santa Barbara, Santa Barbara, California, United States
Peipei Li
Peipei Li
Beijing University of Posts and Telecommunications (BUPT)
Computer VisionImage SynthesisFace Recognition
X
Xing Cui
Beijing University of Posts and Telecommunications, Beijing, China
Shuhan Xia
Shuhan Xia
北京邮电大学
人工智能 多模态
Weixiang Yan
Weixiang Yan
Amazon
Code IntelligenceAgentic RLSoftware Automation
Y
Yiqiao Zhang
Peking Union Medical College Hospital, Beijing, China
Q
Qianyu Zhuang
Peking Union Medical College Hospital, Beijing, China