🤖 AI Summary
Existing methods for 4D respiratory motion modeling rely on high-dose dual-frame acquisitions (start/end frames), yet preoperative low-dose single-frame scans suffer from dynamic background interference within the respiratory cycle—interference that conventional image registration fails to eliminate, leading to inaccurate temporal modeling. To address this, we propose the first single-image-to-video (I2V) synthesis framework tailored for 4D respiratory motion modeling. Its core innovation is a temporal differential field modeling mechanism: a temporal differential diffusion model explicitly captures inter-frame relative motion, while integrated prompt-attention and field-enhancement layers enable deep coupling between motion priors and the generative process. Trained end-to-end on ACDC and 4D Lung datasets, the generated 4D videos evolve strictly along ground-truth motion trajectories, achieving state-of-the-art performance in perceptual quality and temporal consistency metrics.
📝 Abstract
Temporal modeling on regular respiration-induced motions is crucial to image-guided clinical applications. Existing methods cannot simulate temporal motions unless high-dose imaging scans including starting and ending frames exist simultaneously. However, in the preoperative data acquisition stage, the slight movement of patients may result in dynamic backgrounds between the first and last frames in a respiratory period. This additional deviation can hardly be removed by image registration, thus affecting the temporal modeling. To address that limitation, we pioneeringly simulate the regular motion process via the image-to-video (I2V) synthesis framework, which animates with the first frame to forecast future frames of a given length. Besides, to promote the temporal consistency of animated videos, we devise the Temporal Differential Diffusion Model to generate temporal differential fields, which measure the relative differential representations between adjacent frames. The prompt attention layer is devised for fine-grained differential fields, and the field augmented layer is adopted to better interact these fields with the I2V framework, promoting more accurate temporal variation of synthesized videos. Extensive results on ACDC cardiac and 4D Lung datasets reveal that our approach simulates 4D videos along the intrinsic motion trajectory, rivaling other competitive methods on perceptual similarity and temporal consistency. Codes will be available soon.