SliceWorld: A Predictive and Controllable World-State Model for CT Report Generation

📅 2026-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges in CT report generation of modeling three-dimensional anatomical structures and the evolution of lesions across sequential slices, as well as the lack of controllable intervention on lesion-related factors. To this end, it proposes the first world-state framework tailored for CT imaging. The method treats CT scans as an ordered sequence along the z-axis and constructs a factor-aware latent state encompassing anatomy, lesions, and uncertainty, which is then projected into “world tokens.” This representation enables multi-step future slice prediction, lesion-factor intervention, and large language model (LLM)-driven controllable report generation. Through factor-aware encoding, world token projection, and multi-objective pretraining—spanning prediction, factor alignment, and counterfactual reasoning—the model achieves significant improvements in natural language generation metrics and clinical automatic evaluation scores on the M3D-Cap and CT-RATE datasets, demonstrating its effectiveness in slice prediction, factor control, robustness, and lesion sensitivity modulation.
📝 Abstract
CT report generation (CTRG) requires models to summarize three-dimensional anatomical context and pathological findings from hundreds of axial slices. Existing methods typically learn a direct image-to-text mapping, providing limited mechanisms for modeling how CT evidence evolves across slices or how reports respond to controlled changes in latent lesion-related factors. We propose SliceWorld, a CT-specific world-state framework that treats an axial CT scan as an ordered sequence along the z-axis. SliceWorld encodes prefix CT evidence into factor-aware latent states containing anatomy, lesion, and uncertainty components, and projects these states into world tokens used for multi-step future-slice feature prediction, lesion-factor intervention, and LLM-based report generation. The model is first pretrained on CT slice sequences with predictive, factor-aware, and counterfactual objectives, and is then fine-tuned on paired CT-report data. Experiments on M3D-Cap and CT-RATE show that SliceWorld improves natural language generation metrics and clinically oriented automatic evaluation. Further analyses demonstrate multi-horizon future-slice prediction, measurable factor alignment, reduced-slice robustness, and selective lesion-sensitive report modulation.
Problem

Research questions and friction points this paper is trying to address.

CT report generation
world-state model
lesion factors
slice sequence modeling
controllable generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

world-state model
factor-aware latent representation
controllable CT report generation
multi-step slice prediction
counterfactual intervention