Multimodal Optimal Transport for Unsupervised Temporal Segmentation in Surgical Robotics

📅 2026-02-27

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

This work addresses the scarcity of annotated data in surgical robot videos by proposing TASOT, a novel method that, for the first time, jointly leverages intrinsic textual semantics and visual features for unsupervised temporal action segmentation. Formulating the task as a multimodal optimal transport problem, TASOT achieves unsupervised alignment between video frames and surgical actions through unbalanced Gromov–Wasserstein regularization, without requiring surgery-specific pretraining or external supervision. Evaluated on multiple surgical benchmarks—including Cholec80 and StrasBypass70—TASOT substantially outperforms existing zero-shot approaches, improving segmentation performance by 16.5 and 23.7 points, respectively, thereby demonstrating its effectiveness and innovation in unsupervised surgical phase recognition.

Technology Category

Application Category

📝 Abstract

Recognizing surgical phases and steps from video is a fundamental problem in computer-assisted interventions. Recent approaches increasingly rely on large-scale pre-training on thousands of labeled surgical videos, followed by zero-shot transfer to specific procedures. While effective, this strategy incurs substantial computational and data collection costs. In this work, we question whether such heavy pre-training is truly necessary. We propose Text-Augmented Action Segmentation Optimal Transport (TASOT), an unsupervised method for surgical phase and step recognition that extends Action Segmentation Optimal Transport (ASOT) by incorporating textual information generated directly from the videos. TASOT formulates temporal action segmentation as a multimodal optimal transport problem, where the matching cost is defined as a weighted combination of visual and text-based costs. The visual term captures frame-level appearance similarity, while the text term provides complementary semantic cues, and both are jointly regularized through a temporally consistent unbalanced Gromov-Wasserstein formulation. This design enables effective alignment between video frames and surgical actions without surgical-specific pretraining or external web-scale supervision. We evaluate TASOT on multiple benchmark surgical datasets and observe consistent and substantial improvements over existing zero-shot methods, including StrasBypass70 (+23.7), BernBypass70 (+4.5), Cholec80 (+16.5), and AutoLaparo (+19.6). These results demonstrate that fine-grained surgical understanding can be achieved by exploiting information already present in standard visual and textual representations, without resorting to increasingly complex pre-training pipelines. The code will be available at https://github.com/omar8ahmed9/TASOT.

Problem

Research questions and friction points this paper is trying to address.

Unsupervised Temporal Segmentation

Surgical Phase Recognition

Multimodal Optimal Transport

Action Segmentation

Surgical Robotics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Optimal Transport

Unsupervised Temporal Segmentation

Text-Augmented Action Recognition