Cosmos-Surg-dVRK: World Foundation Model-based Automated Online Evaluation of Surgical Robot Policy Learning

📅 2025-10-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Surgical robot policy evaluation faces challenges including high cost, poor reproducibility, and significant sim-to-real discrepancy. To address these, we propose Cosmos-Surg-dVRK—the first fully automated online evaluation framework for surgical robotics grounded in World Foundation Models (WFMs). It integrates high-fidelity soft-tissue deformation simulation with a V-JEPA 2–driven video classifier to enable end-to-end, automatic assessment of dVRK robot policies. Crucially, this work pioneers the application of WFMs to surgical robot evaluation, enabling virtual benchmarking of complex tasks—including tabletop suturing and ex vivo porcine cholecystectomy—for the first time. Experiments demonstrate strong correlation between simulated and real-world performance (Pearson’s *r* > 0.92) and high inter-rater agreement between the video classifier and human annotations (Cohen’s *κ* = 0.87). The framework substantially improves evaluation efficiency, reproducibility, and clinical credibility.

Technology Category

Application Category

📝 Abstract
The rise of surgical robots and vision-language-action models has accelerated the development of autonomous surgical policies and efficient assessment strategies. However, evaluating these policies directly on physical robotic platforms such as the da Vinci Research Kit (dVRK) remains hindered by high costs, time demands, reproducibility challenges, and variability in execution. World foundation models (WFM) for physical AI offer a transformative approach to simulate complex real-world surgical tasks, such as soft tissue deformation, with high fidelity. This work introduces Cosmos-Surg-dVRK, a surgical finetune of the Cosmos WFM, which, together with a trained video classifier, enables fully automated online evaluation and benchmarking of surgical policies. We evaluate Cosmos-Surg-dVRK using two distinct surgical datasets. On tabletop suture pad tasks, the automated pipeline achieves strong correlation between online rollouts in Cosmos-Surg-dVRK and policy outcomes on the real dVRK Si platform, as well as good agreement between human labelers and the V-JEPA 2-derived video classifier. Additionally, preliminary experiments with ex-vivo porcine cholecystectomy tasks in Cosmos-Surg-dVRK demonstrate promising alignment with real-world evaluations, highlighting the platform's potential for more complex surgical procedures.
Problem

Research questions and friction points this paper is trying to address.

Automating surgical policy evaluation using world foundation models
Reducing physical robot testing costs through simulation fidelity
Enabling reproducible benchmarking for autonomous surgical robot policies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cosmos-Surg-dVRK enables automated surgical policy evaluation
World foundation model simulates complex surgical tissue deformation
Video classifier provides automated assessment of surgical outcomes
🔎 Similar Papers