🤖 AI Summary
To address anatomical detail blurring, text-image semantic misalignment, and memory explosion in high-resolution, long-sequence CT modeling for 3D medical imaging, this paper proposes BTB3D: a unified 2D/3D encoder-decoder based on causal convolution, incorporating frequency-aware voxel tokenization and a three-stage curriculum learning strategy to enable memory-efficient long-context modeling. By integrating local reconstruction, overlapping-window slicing, and fine-tuning of a long-context decoder, BTB3D supports end-to-end training on volumetric inputs up to 512×512×241. On report generation, it achieves 40% improvements in BLEU and clinical F1 scores; on text-to-CT synthesis, it reduces FID by 75% and FVD by 50%, generating high-resolution 3D volumes with anatomically consistent structures—setting new state-of-the-art performance.
📝 Abstract
Recent progress in vision-language modeling for 3D medical imaging has been fueled by large-scale computed tomography (CT) corpora with paired free-text reports, stronger architectures, and powerful pretrained models. This has enabled applications such as automated report generation and text-conditioned 3D image synthesis. Yet, current approaches struggle with high-resolution, long-sequence volumes: contrastive pretraining often yields vision encoders that are misaligned with clinical language, and slice-wise tokenization blurs fine anatomy, reducing diagnostic performance on downstream tasks. We introduce BTB3D (Better Tokens for Better 3D), a causal convolutional encoder-decoder that unifies 2D and 3D training and inference while producing compact, frequency-aware volumetric tokens. A three-stage training curriculum enables (i) local reconstruction, (ii) overlapping-window tiling, and (iii) long-context decoder refinement, during which the model learns from short slice excerpts yet generalizes to scans exceeding 300 slices without additional memory overhead. BTB3D sets a new state-of-the-art on two key tasks: it improves BLEU scores and increases clinical F1 by 40% over CT2Rep, CT-CHAT, and Merlin for report generation; and it reduces FID by 75% and halves FVD compared to GenerateCT and MedSyn for text-to-CT synthesis, producing anatomically consistent 512*512*241 volumes. These results confirm that precise three-dimensional tokenization, rather than larger language backbones alone, is essential for scalable vision-language modeling in 3D medical imaging. The codebase is available at: https://github.com/ibrahimethemhamamci/BTB3D