Slot-BERT: Self-supervised Object Discovery in Surgical Video

πŸ“… 2025-01-21
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address unsupervised object discovery in surgical videos, this paper proposes Slot-BERTβ€”a bidirectional Transformer-based model that learns object-centric, temporally consistent slot representations in latent space, eliminating reliance on RNNs or full-video parallel modeling. Its key innovation is a slot contrastive loss that enforces inter-slot orthogonality, thereby suppressing redundancy and promoting representation disentanglement. The architecture enables seamless processing of variable-length videos and zero-shot cross-specialty domain adaptation. Evaluated on real-world laparoscopic cholecystectomy and thoracic surgery videos, Slot-BERT significantly outperforms existing unsupervised object-centric methods. Moreover, it demonstrates robust zero-shot transfer capability across diverse datasets and surgical procedures, establishing new state-of-the-art performance in unsupervised surgical video understanding.

Technology Category

Application Category

πŸ“ Abstract
Object-centric slot attention is a powerful framework for unsupervised learning of structured and explainable representations that can support reasoning about objects and actions, including in surgical videos. While conventional object-centric methods for videos leverage recurrent processing to achieve efficiency, they often struggle with maintaining long-range temporal coherence required for long videos in surgical applications. On the other hand, fully parallel processing of entire videos enhances temporal consistency but introduces significant computational overhead, making it impractical for implementation on hardware in medical facilities. We present Slot-BERT, a bidirectional long-range model that learns object-centric representations in a latent space while ensuring robust temporal coherence. Slot-BERT scales object discovery seamlessly to long videos of unconstrained lengths. A novel slot contrastive loss further reduces redundancy and improves the representation disentanglement by enhancing slot orthogonality. We evaluate Slot-BERT on real-world surgical video datasets from abdominal, cholecystectomy, and thoracic procedures. Our method surpasses state-of-the-art object-centric approaches under unsupervised training achieving superior performance across diverse domains. We also demonstrate efficient zero-shot domain adaptation to data from diverse surgical specialties and databases.
Problem

Research questions and friction points this paper is trying to address.

Automatic Object Recognition
Long Video Processing
Unseen Surgical Type Adaptability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Slot-BERT
surgical video analysis
unsupervised learning
πŸ”Ž Similar Papers
No similar papers found.
Guiqiu Liao
Guiqiu Liao
University of Pennsylvania
Surgical roboticsComputer visionMachine learning
M
Matjaz Jogan
Penn Computer Assisted Surgery and Outcomes Laboratory, Department of Surgery, University of Pennsylvania, Philadelphia, PA, USA
M
Marcel Hussing
Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA, USA
K
Kenta Nakahashi
Division of Thoracic Surgery, Toronto General Hospital, University Health Network, Toronto, Ontario, Canada
K
Kazuhiro Yasufuku
Division of Thoracic Surgery, Toronto General Hospital, University Health Network, Toronto, Ontario, Canada
A
Amin Madani
Surgical Artificial Intelligence Research Academy, University Health Network, Toronto, ON, Canada
Eric Eaton
Eric Eaton
University of Pennsylvania
artificial intelligencemachine learningcontinual learningroboticsmedicine
D
Daniel A. Hashimoto
Penn Computer Assisted Surgery and Outcomes Laboratory, Department of Surgery, University of Pennsylvania, Philadelphia, PA, USA; Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA, USA