Reasoning-Enhanced Object-Centric Learning for Videos

📅 2024-03-22
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing slot-based video models excel at object segmentation and tracking but lack explicit, physics-inspired reasoning capabilities, limiting their perceptual and predictive performance in complex scenes. To address this, we propose the Slot-based Time-Space Transformer with Memory buffer (STATM), the first slot-based architecture to integrate an explicit, physics-inspired reasoning module. STATM jointly leverages time-space slot attention and a memory buffer mechanism to enable persistent cross-frame object state modeling and causal prediction. The framework unifies object-disentangled representation learning, video prediction, and visual question answering within a single end-to-end trainable architecture. Evaluated on multiple video benchmarks, STATM achieves state-of-the-art performance across object segmentation, tracking, future frame prediction, and video-based VQA. All code and data are publicly released.

Technology Category

Application Category

📝 Abstract
Object-centric learning aims to break down complex visual scenes into more manageable object representations, enhancing the understanding and reasoning abilities of machine learning systems toward the physical world. Recently, slot-based video models have demonstrated remarkable proficiency in segmenting and tracking objects, but they overlook the importance of the effective reasoning module. In the real world, reasoning and predictive abilities play a crucial role in human perception and object tracking; in particular, these abilities are closely related to human intuitive physics. Inspired by this, we designed a novel reasoning module called the Slot-based Time-Space Transformer with Memory buffer (STATM) to enhance the model's perception ability in complex scenes. The memory buffer primarily serves as storage for slot information from upstream modules, the Slot-based Time-Space Transformer makes predictions through slot-based spatiotemporal attention computations and fusion. Our experimental results on various datasets indicate that the STATM module can significantly enhance the capabilities of multiple state-of-the-art object-centric learning models for video. Moreover, as a predictive model, the STATM module also performs well in downstream prediction and Visual Question Answering (VQA) tasks. We will release our codes and data at https://github.com/intell-sci-comput/STATM.
Problem

Research questions and friction points this paper is trying to address.

Enhance object-centric learning in videos
Integrate reasoning for better object tracking
Improve predictive abilities in complex scenes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Slot-based Time-Space Transformer
Memory buffer for slot storage
Spatiotemporal attention computations fusion
🔎 Similar Papers
No similar papers found.
J
Jian Li
Renmin University of China
Pu Ren
Pu Ren
Lawrence Berkeley National Lab, Northeastern University
Machine LearningAI for Science
Y
Yang Liu
University of Chinese Academy of Sciences
H
Hao Sun
Renmin University of China