Surgical Vision World Model

📅 2025-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current surgical simulation suffers from insufficient realism, and existing world models rely on action-labeled data—unavailable in most clinical surgical videos—hindering the development of autonomous surgical agents. This paper introduces the first unsupervised world model tailored for surgical vision, capable of implicitly learning action representations directly from unlabeled raw surgical video without requiring explicit action annotations, while enabling action-controllable, high-fidelity video generation. Methodologically, we propose a Genie-inspired latent-variable framework that jointly integrates video autoregressive modeling with a latent action disentanglement structure, trained end-to-end on the unlabeled SurgToolLoc-2022 dataset. Experiments demonstrate temporally coherent, anatomically detailed video synthesis, supporting action interpolation and conditional video re-generation. Our approach establishes a novel paradigm for medical training and AI-driven surgical agent development.

Technology Category

Application Category

📝 Abstract
Realistic and interactive surgical simulation has the potential to facilitate crucial applications, such as medical professional training and autonomous surgical agent training. In the natural visual domain, world models have enabled action-controlled data generation, demonstrating the potential to train autonomous agents in interactive simulated environments when large-scale real data acquisition is infeasible. However, such works in the surgical domain have been limited to simplified computer simulations, and lack realism. Furthermore, existing literature in world models has predominantly dealt with action-labeled data, limiting their applicability to real-world surgical data, where obtaining action annotation is prohibitively expensive. Inspired by the recent success of Genie in leveraging unlabeled video game data to infer latent actions and enable action-controlled data generation, we propose the first surgical vision world model. The proposed model can generate action-controllable surgical data and the architecture design is verified with extensive experiments on the unlabeled SurgToolLoc-2022 dataset. Codes and implementation details are available at https://github.com/bhattarailab/Surgical-Vision-World-Model
Problem

Research questions and friction points this paper is trying to address.

Develops a surgical vision world model for realistic simulation.
Enables action-controllable data generation without labeled actions.
Validates model using unlabeled surgical dataset SurgToolLoc-2022.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates action-controllable surgical data
Uses unlabeled video game data for training
Verified with unlabeled SurgToolLoc-2022 dataset
🔎 Similar Papers
No similar papers found.
S
Saurabh Koju
Nepal Applied Mathematics and Informatics Institute for research (Naamii), Nepal
S
Saurav Bastola
Nepal Applied Mathematics and Informatics Institute for research (Naamii), Nepal
Prashant Shrestha
Prashant Shrestha
Research Assistant, NAAMII
Machine Learning
Sanskar Amgain
Sanskar Amgain
University of Tennessee
Machine Learning
Y
Y. Shrestha
University of Lausanne, Switzerland
R
Rudra P. K. Poudel
Cambridge Research Laboratory, Toshiba Europe Ltd, UK
Binod Bhattarai
Binod Bhattarai
Assistant Professor, University of Aberdeen
Machine LearningMedical Image AnalysisComputer Vision