🤖 AI Summary
Existing slot-based video models excel at object segmentation and tracking but lack explicit, physics-inspired reasoning capabilities, limiting their perceptual and predictive performance in complex scenes. To address this, we propose the Slot-based Time-Space Transformer with Memory buffer (STATM), the first slot-based architecture to integrate an explicit, physics-inspired reasoning module. STATM jointly leverages time-space slot attention and a memory buffer mechanism to enable persistent cross-frame object state modeling and causal prediction. The framework unifies object-disentangled representation learning, video prediction, and visual question answering within a single end-to-end trainable architecture. Evaluated on multiple video benchmarks, STATM achieves state-of-the-art performance across object segmentation, tracking, future frame prediction, and video-based VQA. All code and data are publicly released.
📝 Abstract
Object-centric learning aims to break down complex visual scenes into more manageable object representations, enhancing the understanding and reasoning abilities of machine learning systems toward the physical world. Recently, slot-based video models have demonstrated remarkable proficiency in segmenting and tracking objects, but they overlook the importance of the effective reasoning module. In the real world, reasoning and predictive abilities play a crucial role in human perception and object tracking; in particular, these abilities are closely related to human intuitive physics. Inspired by this, we designed a novel reasoning module called the Slot-based Time-Space Transformer with Memory buffer (STATM) to enhance the model's perception ability in complex scenes. The memory buffer primarily serves as storage for slot information from upstream modules, the Slot-based Time-Space Transformer makes predictions through slot-based spatiotemporal attention computations and fusion. Our experimental results on various datasets indicate that the STATM module can significantly enhance the capabilities of multiple state-of-the-art object-centric learning models for video. Moreover, as a predictive model, the STATM module also performs well in downstream prediction and Visual Question Answering (VQA) tasks. We will release our codes and data at https://github.com/intell-sci-comput/STATM.