🤖 AI Summary
This work addresses the challenge of high-fidelity, text-guided editing of ophthalmic surgical videos under stringent anatomical and temporal constraints by proposing the first training-free video editing framework. The method leverages deterministic second-order ODE inversion to extract and store attention tensors from the original video, which are then selectively injected into the Classifier-Free Guidance (CFG) branch during the diffusion model’s denoising process. This enables precise semantic modifications while rigorously preserving ocular anatomical structures and temporal coherence. As the inaugural application of training-free video editing in ophthalmic surgery, the approach generates diverse, annotated medical videos without requiring model fine-tuning or additional data collection. Clinical evaluations demonstrate its superior performance over existing general-purpose video editors on complex tasks such as instrument replacement and procedural modification.
📝 Abstract
High-fidelity surgical video generation can greatly improve medical training and the development of AI, adapting these generative models for precise video editing remains a formidable challenge. Modifying surgical attributes, such as instrument tissue interactions or procedural phases is challenging due to the strict anatomical and temporal constraints. In this paper, we propose OphEdit, a novel training-free framework for the text-guided editing of ophthalmic surgical videos. Our approach leverages a deterministic second-order ODE inversion pipeline to capture Attention Value (V) tensors from the original video. By selectively injecting these stored tensors into the conditional Classifier-Free Guidance (CFG) branch during the denoising phase, OphEdit rigorously preserves the intricate anatomical geometry of the eye while seamlessly mapping text-driven semantic modifications onto the video stream. Clinical evaluations demonstrates that OphEdit effectively handles complex surgical transformations, such as instrument swaps and procedural variations, with superior structural fidelity and temporal consistency compared to natural-domain video editors. Our work represents the first application of training-free video editing in the ophthalmic surgical domain, offering a scalable solution for generating diverse, annotated medical datasets without the need for exhaustive manual recording or costly model fine-tuning. The code and prompts can be accessed at https://github.com/ophedit/OphEdit