SWoMo: Neuro-Symbolic World Model for Cataract Surgery Simulation

๐Ÿ“… 2026-05-15
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

203K/year
๐Ÿค– AI Summary
Current surgical simulation methods struggle to simultaneously achieve visual realism, physically plausible interactions, and out-of-distribution generalization, limiting their clinical applicability. This work proposes SWoMoโ€”a neuro-symbolic world model for cataract surgeryโ€”that decouples motion generation from visual rendering: a symbolic component models tool-tissue interaction dynamics using a rule-based simulator and scene graphs, while a diffusion model synthesizes high-fidelity visual appearances. By innovatively integrating symbolic reasoning with neural generation and introducing an inverse pairing strategy to reconstruct real surgical videos into simulation data, the approach enables high-quality sim-to-real transfer and generalization to unseen interaction geometries. Experiments demonstrate that SWoMo significantly outperforms existing methods in visual fidelity, downstream phase detection accuracy, and unsupervised style transfer, validating its clinical relevance and robust generalization capabilities.
๐Ÿ“ Abstract
Realistic surgical simulation plays a crucial role in training novice surgeons and in the development of autonomous agents. World models can scale such simulation environments to realistic and diverse procedures by predicting future patient states conditioned on current observations and surgical actions. However, current state-of-the-art approaches often fail to satisfy key criteria required for clinical applicability, including visual realism, physically grounded interactions, and the ability to simulate scenarios beyond the training distribution. Hence, we introduce SWoMo, a neuro-symbolic world model for cataract surgery simulation that decouples motion generation from visual realism. The symbolic component, consisting of a rule-based simulator and scene graph representations, models motion dynamics and tool-tissue interactions, while a diffusion model produces realistic visual appearance, including textures and tissue deformations. We propose an inverse pairing strategy that reconstructs real surgical videos in the simulator to obtain paired simulated and real videos, which are then used to train our video diffusion model for the reverse objective of sim-to-real translation. Our experiments show both qualitative and quantitative improvements over prior work. We demonstrate that our simulator further satisfies the key criteria, including generalisation to unseen interaction geometries, improvements in downstream phase detection, and unsupervised video style transfer. The code, data, and model weights are available at: https://ssharvienkumar.github.io/SWoMo/
Problem

Research questions and friction points this paper is trying to address.

surgical simulation
world model
visual realism
physically grounded interactions
out-of-distribution generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

neuro-symbolic
world model
surgical simulation
diffusion model
sim-to-real translation