CTFlow: Video-Inspired Latent Flow Matching for 3D CT Synthesis

📅 2025-08-18

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This study addresses the critical bottlenecks in medical imaging—scarcity of annotated 3D CT volumes, privacy constraints, and stringent regulatory requirements—by proposing a novel clinical report–conditioned 3D CT volume generation paradigm. Methodologically, it introduces a text-conditioned autoregressive generative framework: an asymmetric VAE learns a compact latent space; CT-CLIP encodes semantic report features; and a 0.5B-parameter Transformer, trained via flow matching, enables precise cross-modal alignment. Crucially, it proposes a video-inspired slice-sequence-level autoregressive strategy that models latent representations segment-wise, preserving 3D anatomical continuity, diagnostic fidelity, and memory efficiency. Evaluated on the CT-RATE benchmark, the method achieves state-of-the-art performance across all metrics—FID, FVD, Inception Score (IS), and CLIP Score—demonstrating substantial improvements in temporal coherence, volumetric diversity, and text–image alignment.

Technology Category

Application Category

📝 Abstract

Generative modelling of entire CT volumes conditioned on clinical reports has the potential to accelerate research through data augmentation, privacy-preserving synthesis and reducing regulator-constraints on patient data while preserving diagnostic signals. With the recent release of CT-RATE, a large-scale collection of 3D CT volumes paired with their respective clinical reports, training large text-conditioned CT volume generation models has become achievable. In this work, we introduce CTFlow, a 0.5B latent flow matching transformer model, conditioned on clinical reports. We leverage the A-VAE from FLUX to define our latent space, and rely on the CT-Clip text encoder to encode the clinical reports. To generate consistent whole CT volumes while keeping the memory constraints tractable, we rely on a custom autoregressive approach, where the model predicts the first sequence of slices of the volume from text-only, and then relies on the previously generated sequence of slices and the text, to predict the following sequence. We evaluate our results against state-of-the-art generative CT model, and demonstrate the superiority of our approach in terms of temporal coherence, image diversity and text-image alignment, with FID, FVD, IS scores and CLIP score.

Problem

Research questions and friction points this paper is trying to address.

Generate 3D CT volumes from clinical reports

Improve temporal coherence in CT synthesis

Enhance text-image alignment in medical imaging

Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent flow matching transformer model

A-VAE and CT-Clip text encoder

Custom autoregressive slice prediction

🔎 Similar Papers

Post-mastoidectomy Surface Multi-View Synthesis from a Single Microscopy Image