Ophora: A Large-Scale Data-Driven Text-Guided Ophthalmic Surgical Video Generation Model

📅 2025-05-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of annotated training data, high annotation costs, and privacy sensitivity in ophthalmic surgery AI development, this paper introduces Ophora-160K—the first large-scale ophthalmic surgical video–instruction paired dataset comprising 160,000 high-quality samples—and proposes a progressive video–instruction fine-tuning paradigm tailored to domain-specific surgical tasks. We innovatively design a text-driven video generation framework that enables cross-domain spatiotemporal knowledge transfer and multimodal diffusion modeling under strict privacy-preserving constraints. Clinically validated, the generated videos achieve superior realism and procedural fidelity compared to baselines, as confirmed by expert evaluation. Furthermore, our approach improves accuracy by 12.7% on downstream surgical workflow understanding tasks. Both the codebase and the Ophora-160K dataset are publicly released to foster reproducible research and community advancement.

Technology Category

Application Category

📝 Abstract
In ophthalmic surgery, developing an AI system capable of interpreting surgical videos and predicting subsequent operations requires numerous ophthalmic surgical videos with high-quality annotations, which are difficult to collect due to privacy concerns and labor consumption. Text-guided video generation (T2V) emerges as a promising solution to overcome this issue by generating ophthalmic surgical videos based on surgeon instructions. In this paper, we present Ophora, a pioneering model that can generate ophthalmic surgical videos following natural language instructions. To construct Ophora, we first propose a Comprehensive Data Curation pipeline to convert narrative ophthalmic surgical videos into a large-scale, high-quality dataset comprising over 160K video-instruction pairs, Ophora-160K. Then, we propose a Progressive Video-Instruction Tuning scheme to transfer rich spatial-temporal knowledge from a T2V model pre-trained on natural video-text datasets for privacy-preserved ophthalmic surgical video generation based on Ophora-160K. Experiments on video quality evaluation via quantitative analysis and ophthalmologist feedback demonstrate that Ophora can generate realistic and reliable ophthalmic surgical videos based on surgeon instructions. We also validate the capability of Ophora for empowering downstream tasks of ophthalmic surgical workflow understanding. Code is available at https://github.com/mar-cry/Ophora.
Problem

Research questions and friction points this paper is trying to address.

Generating annotated ophthalmic surgical videos from text instructions
Overcoming data scarcity and privacy issues in surgical video collection
Enhancing AI understanding of ophthalmic surgical workflows via synthetic videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive Data Curation pipeline for dataset creation
Progressive Video-Instruction Tuning for knowledge transfer
Generates realistic ophthalmic surgical videos from text
🔎 Similar Papers
No similar papers found.
W
Wei Li
Shanghai Jiao Tong University, China; Shanghai Artificial Intelligence Laboratory, China
M
Ming Hu
Monash University, Australia; Shanghai Artificial Intelligence Laboratory, China
Guoan Wang
Guoan Wang
Stevens Institute of Technology
General Medical AI
Lihao Liu
Lihao Liu
Amazon
LLM-based AgentHealthcare AI
K
Kaijin Zhou
Eye Hospital, Wenzhou Medical University, China
J
Junzhi Ning
Imperial College London, UK; Shanghai Artificial Intelligence Laboratory, China
X
Xin Guo
Shanghai Academy of Artificial Intelligence for Science, China
Z
Zongyuan Ge
Monash University, Australia
Lixu Gu
Lixu Gu
Professor of Shanghai jiaotong university
medical image analysisimage guided intervention
Junjun He
Junjun He
Shanghai Jiao Tong University