OmniCT: Towards a Unified Slice-Volume LVLM for Comprehensive CT Analysis

📅 2026-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that existing large vision-language models for medical imaging struggle to simultaneously capture fine-grained slice-level details and maintain voxel-level spatial consistency in CT analysis, due to the absence of a unified modeling paradigm. To bridge this gap, we propose a unified slice-voxel large vision-language model that integrates local detail with global spatial semantics through triaxial positional encoding, a mixture-of-experts projection module, and organ segmentation–guided region-of-interest localization. We further introduce novel mechanisms for spatial consistency enhancement and organ-level semantic enrichment. Additionally, we construct MedEval-CT, the largest-scale benchmark to date for slice-voxel CT evaluation. Extensive experiments demonstrate that our approach significantly outperforms current models across multiple clinical tasks, achieving both high sensitivity to microscopic details and robust macroscopic spatial reasoning capabilities.

Technology Category

Application Category

📝 Abstract
Computed Tomography (CT) is one of the most widely used and diagnostically information-dense imaging modalities, covering critical organs such as the heart, lungs, liver, and colon. Clinical interpretation relies on both slice-driven local features (e.g., sub-centimeter nodules, lesion boundaries) and volume-driven spatial representations (e.g., tumor infiltration, inter-organ anatomical relations). However, existing Large Vision-Language Models (LVLMs) remain fragmented in CT slice versus volumetric understanding: slice-driven LVLMs show strong generalization but lack cross-slice spatial consistency, while volume-driven LVLMs explicitly capture volumetric semantics but suffer from coarse granularity and poor compatibility with slice inputs. The absence of a unified modeling paradigm constitutes a major bottleneck for the clinical translation of medical LVLMs. We present OmniCT, a powerful unified slice-volume LVLM for CT scenarios, which makes three contributions: (i) Spatial Consistency Enhancement (SCE): volumetric slice composition combined with tri-axial positional embedding that introduces volumetric consistency, and an MoE hybrid projection enables efficient slice-volume adaptation; (ii) Organ-level Semantic Enhancement (OSE): segmentation and ROI localization explicitly align anatomical regions, emphasizing lesion- and organ-level semantics; (iii) MedEval-CT: the largest slice-volume CT dataset and hybrid benchmark integrates comprehensive metrics for unified evaluation. OmniCT consistently outperforms existing methods with a substantial margin across diverse clinical tasks and satisfies both micro-level detail sensitivity and macro-level spatial reasoning. More importantly, it establishes a new paradigm for cross-modal medical imaging understanding.
Problem

Research questions and friction points this paper is trying to address.

Computed Tomography
Large Vision-Language Models
slice-volume understanding
spatial consistency
medical imaging
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified Slice-Volume Modeling
Spatial Consistency Enhancement
Organ-level Semantic Enhancement
Medical Vision-Language Model
Tri-axial Positional Embedding
🔎 Similar Papers
No similar papers found.