SG2VID: Scene Graphs Enable Fine-Grained Control for Video Synthesis

📅 2025-06-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional surgical simulators suffer from limitations in anatomical fidelity and controllability: existing video generation methods prioritize conditional accuracy but lack fine-grained, interactive editing capabilities. To address this, we propose the first scene graph–driven diffusion-based surgical video generation model—introducing structured semantic representations to surgical video synthesis for the first time. Our method enables precise, editable control over instrument dimensions, anatomical motion, novel instrument insertion, and scene layout, while supporting interactive generation of rare intraoperative anomalies. It integrates scene graph encoding and alignment, multimodal conditional injection, and spatiotemporal modeling tailored to surgical videos. Evaluated on cataract and cholecystectomy datasets, our approach significantly outperforms state-of-the-art methods. Generated videos substantially improve downstream surgical phase detection performance, achieving superior fidelity, strong controllability, and high visual diversity.

Technology Category

Application Category

📝 Abstract
Surgical simulation plays a pivotal role in training novice surgeons, accelerating their learning curve and reducing intra-operative errors. However, conventional simulation tools fall short in providing the necessary photorealism and the variability of human anatomy. In response, current methods are shifting towards generative model-based simulators. Yet, these approaches primarily focus on using increasingly complex conditioning for precise synthesis while neglecting the fine-grained human control aspect. To address this gap, we introduce SG2VID, the first diffusion-based video model that leverages Scene Graphs for both precise video synthesis and fine-grained human control. We demonstrate SG2VID's capabilities across three public datasets featuring cataract and cholecystectomy surgery. While SG2VID outperforms previous methods both qualitatively and quantitatively, it also enables precise synthesis, providing accurate control over tool and anatomy's size and movement, entrance of new tools, as well as the overall scene layout. We qualitatively motivate how SG2VID can be used for generative augmentation and present an experiment demonstrating its ability to improve a downstream phase detection task when the training set is extended with our synthetic videos. Finally, to showcase SG2VID's ability to retain human control, we interact with the Scene Graphs to generate new video samples depicting major yet rare intra-operative irregularities.
Problem

Research questions and friction points this paper is trying to address.

Lack of photorealism and anatomical variability in surgical simulations
Limited fine-grained human control in generative model-based simulators
Need for precise synthesis and control in surgical video generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages Scene Graphs for video synthesis
Enables fine-grained human control
Uses diffusion-based model for realism
🔎 Similar Papers
No similar papers found.
S
Ssharvien Kumar R. Sivakumar
TU Darmstadt, Fraunhoferstr. 5, Darmstadt, 64297, Germany
Yannik Frisch
Yannik Frisch
PHD Student, TU Darmstadt
Generative ModelsRepresentation LearningSurgical DataMedical Imaging
Ghazal Ghazaei
Ghazal Ghazaei
Carl Zeiss AG
Deep learningVideo understandingSurgical Workflow AnalysisAI in Health
A
Anirban Mukhopadhyay
TU Darmstadt, Fraunhoferstr. 5, Darmstadt, 64297, Germany