Future Slot Prediction for Unsupervised Object Discovery in Surgical Video

📅 2025-07-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing unsupervised object discovery methods for surgical videos suffer from poor temporal consistency and weak dynamic object parsing, particularly in adaptive slot-number estimation. To address this, we propose the Dynamic Temporal Slot Transformer (DT-Slot Transformer), the first framework to incorporate *future slot initialization prediction* into unsupervised object-centric learning. Our method constructs object-centric representations via slot attention and models inter-frame dynamics using a temporal Transformer, enabling joint adaptive slot cardinality adjustment and future state prediction. Evaluated on multiple public surgical video datasets, DT-Slot Transformer achieves state-of-the-art performance in object segmentation accuracy and trajectory consistency. Ablation studies confirm that the future slot prediction mechanism significantly enhances temporal modeling fidelity for medical video analysis. This work establishes a novel paradigm for unsupervised object discovery, advancing its applicability toward real-world clinical scenarios.

Technology Category

Application Category

📝 Abstract
Object-centric slot attention is an emerging paradigm for unsupervised learning of structured, interpretable object-centric representations (slots). This enables effective reasoning about objects and events at a low computational cost and is thus applicable to critical healthcare applications, such as real-time interpretation of surgical video. The heterogeneous scenes in real-world applications like surgery are, however, difficult to parse into a meaningful set of slots. Current approaches with an adaptive slot count perform well on images, but their performance on surgical videos is low. To address this challenge, we propose a dynamic temporal slot transformer (DTST) module that is trained both for temporal reasoning and for predicting the optimal future slot initialization. The model achieves state-of-the-art performance on multiple surgical databases, demonstrating that unsupervised object-centric methods can be applied to real-world data and become part of the common arsenal in healthcare applications.
Problem

Research questions and friction points this paper is trying to address.

Predict future slots for unsupervised surgical video analysis
Improve object discovery in heterogeneous surgical scenes
Enhance temporal reasoning for dynamic slot initialization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic temporal slot transformer for reasoning
Predicts optimal future slot initialization
Achieves state-of-the-art surgical video performance
🔎 Similar Papers
No similar papers found.
Guiqiu Liao
Guiqiu Liao
University of Pennsylvania
Surgical roboticsComputer visionMachine learning
M
Matjaz Jogan
PCASO Laboratory, Department of Surgery, University of Pennsylvania
M
Marcel Hussing
Department of Computer and Information Science, University of Pennsylvania
Edward Zhang
Edward Zhang
Student in ECE, Carnegie Mellon University
Machine Learning
Eric Eaton
Eric Eaton
University of Pennsylvania
artificial intelligencemachine learningcontinual learningroboticsmedicine
D
Daniel A. Hashimoto
PCASO Laboratory, Department of Surgery, University of Pennsylvania