From Vicious to Virtuous Cycles: Synergistic Representation Learning for Unsupervised Video Object-Centric Learning

📅 2026-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the representational inconsistency between encoder and decoder in unsupervised video object-centric learning, which often leads to a vicious cycle of blurry reconstructions and noisy features due to reconstruction-based training. To resolve this, the authors propose a Synergistic Representation Learning (SRL) framework that establishes, for the first time, a mutually reinforcing mechanism between the two components: the encoder generates sharp attention maps to enhance the decoder’s semantic boundaries, while the decoder leverages spatial consistency to refine and denoise the encoder’s features. Combined with a slot-regularization warm-up strategy and a high-frequency attention–spatial consistency fusion technique, SRL effectively mitigates representational conflict, significantly improving the clarity and coherence of object segmentation and reconstruction, and achieves state-of-the-art performance across multiple benchmarks.

Technology Category

Application Category

📝 Abstract
Unsupervised object-centric learning models, particularly slot-based architectures, have shown great promise in decomposing complex scenes. However, their reliance on reconstruction-based training creates a fundamental conflict between the sharp, high-frequency attention maps of the encoder and the spatially consistent but blurry reconstruction maps of the decoder. We identify that this discrepancy gives rise to a vicious cycle: the noisy feature map from the encoder forces the decoder to average over possibilities and produce even blurrier outputs, while the gradient computed from blurry reconstruction maps lacks high-frequency details necessary to supervise encoder features. To break this cycle, we introduce Synergistic Representation Learning (SRL) that establishes a virtuous cycle where the encoder and decoder mutually refine one another. SRL leverages the encoder's sharpness to deblur the semantic boundary within the decoder output, while exploiting the decoder's spatial consistency to denoise the encoder's features. This mutual refinement process is stabilized by a warm-up phase with a slot regularization objective that initially allocates distinct entities per slot. By bridging the representational gap between the encoder and decoder, SRL achieves state-of-the-art results on video object-centric learning benchmarks. Codes are available at https://github.com/hynnsk/SRL.
Problem

Research questions and friction points this paper is trying to address.

unsupervised object-centric learning
slot-based architectures
reconstruction-based training
encoder-decoder discrepancy
video object-centric learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synergistic Representation Learning
slot-based architectures
unsupervised object-centric learning
encoder-decoder alignment
video object segmentation
🔎 Similar Papers
No similar papers found.