🤖 AI Summary
Existing unsupervised object discovery methods for surgical videos suffer from poor temporal consistency and weak dynamic object parsing, particularly in adaptive slot-number estimation. To address this, we propose the Dynamic Temporal Slot Transformer (DT-Slot Transformer), the first framework to incorporate *future slot initialization prediction* into unsupervised object-centric learning. Our method constructs object-centric representations via slot attention and models inter-frame dynamics using a temporal Transformer, enabling joint adaptive slot cardinality adjustment and future state prediction. Evaluated on multiple public surgical video datasets, DT-Slot Transformer achieves state-of-the-art performance in object segmentation accuracy and trajectory consistency. Ablation studies confirm that the future slot prediction mechanism significantly enhances temporal modeling fidelity for medical video analysis. This work establishes a novel paradigm for unsupervised object discovery, advancing its applicability toward real-world clinical scenarios.
📝 Abstract
Object-centric slot attention is an emerging paradigm for unsupervised learning of structured, interpretable object-centric representations (slots). This enables effective reasoning about objects and events at a low computational cost and is thus applicable to critical healthcare applications, such as real-time interpretation of surgical video. The heterogeneous scenes in real-world applications like surgery are, however, difficult to parse into a meaningful set of slots. Current approaches with an adaptive slot count perform well on images, but their performance on surgical videos is low. To address this challenge, we propose a dynamic temporal slot transformer (DTST) module that is trained both for temporal reasoning and for predicting the optimal future slot initialization. The model achieves state-of-the-art performance on multiple surgical databases, demonstrating that unsupervised object-centric methods can be applied to real-world data and become part of the common arsenal in healthcare applications.