SWoMo: Neuro-Symbolic World Model for Cataract Surgery Simulation

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Current surgical simulation methods struggle to simultaneously achieve visual realism, physically plausible interactions, and out-of-distribution generalization, limiting their clinical applicability. This work proposes SWoMo—a neuro-symbolic world model for cataract surgery—that decouples motion generation from visual rendering: a symbolic component models tool-tissue interaction dynamics using a rule-based simulator and scene graphs, while a diffusion model synthesizes high-fidelity visual appearances. By innovatively integrating symbolic reasoning with neural generation and introducing an inverse pairing strategy to reconstruct real surgical videos into simulation data, the approach enables high-quality sim-to-real transfer and generalization to unseen interaction geometries. Experiments demonstrate that SWoMo significantly outperforms existing methods in visual fidelity, downstream phase detection accuracy, and unsupervised style transfer, validating its clinical relevance and robust generalization capabilities.

📝 Abstract

Realistic surgical simulation plays a crucial role in training novice surgeons and in the development of autonomous agents. World models can scale such simulation environments to realistic and diverse procedures by predicting future patient states conditioned on current observations and surgical actions. However, current state-of-the-art approaches often fail to satisfy key criteria required for clinical applicability, including visual realism, physically grounded interactions, and the ability to simulate scenarios beyond the training distribution. Hence, we introduce SWoMo, a neuro-symbolic world model for cataract surgery simulation that decouples motion generation from visual realism. The symbolic component, consisting of a rule-based simulator and scene graph representations, models motion dynamics and tool-tissue interactions, while a diffusion model produces realistic visual appearance, including textures and tissue deformations. We propose an inverse pairing strategy that reconstructs real surgical videos in the simulator to obtain paired simulated and real videos, which are then used to train our video diffusion model for the reverse objective of sim-to-real translation. Our experiments show both qualitative and quantitative improvements over prior work. We demonstrate that our simulator further satisfies the key criteria, including generalisation to unseen interaction geometries, improvements in downstream phase detection, and unsupervised video style transfer. The code, data, and model weights are available at: https://ssharvienkumar.github.io/SWoMo/

Problem

Research questions and friction points this paper is trying to address.

surgical simulation

world model

visual realism

physically grounded interactions

out-of-distribution generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

neuro-symbolic

world model

surgical simulation