Are Video Models Emerging as Zero-Shot Learners and Reasoners in Medical Imaging?

📅 2025-10-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study pioneers the investigation of zero-shot transferability of general-purpose large video models (LVMs) to medical imaging—specifically, whether LVMs pretrained exclusively on natural videos can directly perform diverse medical tasks—including organ segmentation, denoising, super-resolution, and respiratory/cardiac motion prediction—without any exposure to medical data. Method: We propose an autoregressive spatiotemporal modeling framework operating on 4D CT sequences, leveraging the LVM’s intrinsic generative understanding to reconstruct 3D anatomical structures and infer dynamic physiological processes. Contribution/Results: Evaluated on 122 patient cases, our zero-shot paradigm achieves state-of-the-art performance across all tasks without fine-tuning. Notably, for motion prediction, it simultaneously ensures high spatial fidelity and temporal consistency—outperforming task-specific supervised models. This work empirically validates the feasibility of general video models as unified medical foundation models and establishes a novel zero-shot paradigm for medical AI.

Technology Category

Application Category

📝 Abstract
Recent advances in large generative models have shown that simple autoregressive formulations, when scaled appropriately, can exhibit strong zero-shot generalization across domains. Motivated by this trend, we investigate whether autoregressive video modeling principles can be directly applied to medical imaging tasks, despite the model never being trained on medical data. Specifically, we evaluate a large vision model (LVM) in a zero-shot setting across four representative tasks: organ segmentation, denoising, super-resolution, and motion prediction. Remarkably, even without domain-specific fine-tuning, the LVM can delineate anatomical structures in CT scans and achieve competitive performance on segmentation, denoising, and super-resolution. Most notably, in radiotherapy motion prediction, the model forecasts future 3D CT phases directly from prior phases of a 4D CT scan, producing anatomically consistent predictions that capture patient-specific respiratory dynamics with realistic temporal coherence. We evaluate the LVM on 4D CT data from 122 patients, totaling over 1,820 3D CT volumes. Despite no prior exposure to medical data, the model achieves strong performance across all tasks and surpasses specialized DVF-based and generative baselines in motion prediction, achieving state-of-the-art spatial accuracy. These findings reveal the emergence of zero-shot capabilities in medical video modeling and highlight the potential of general-purpose video models to serve as unified learners and reasoners laying the groundwork for future medical foundation models built on video models.
Problem

Research questions and friction points this paper is trying to address.

Investigating autoregressive video models for zero-shot medical imaging tasks
Evaluating large vision models on organ segmentation and image enhancement
Forecasting radiotherapy motion from 4D CT scans without medical training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive video modeling applied to medical imaging
Zero-shot generalization across four medical tasks
Forecasting future CT phases with anatomical consistency
🔎 Similar Papers
No similar papers found.
Yuxiang Lai
Yuxiang Lai
Ph.D. Student in Computer Science, Emory University
Computer VisionMedical Imaging
Jike Zhong
Jike Zhong
University of Southern California
Computer VisionMachine Learning
M
Ming Li
Department of Computer Science, University of Maryland
Y
Yuheng Li
Department of Biomedical Engineering, Georgia Institute of Technology
X
Xiaofeng Yang
Department of Computer Science, Emory University; Department of Radiation Oncology and Winship Cancer Institute, Emory University