Reconstruction-Guided Slot Curriculum: Addressing Object Over-Fragmentation in Video Object-Centric Learning

📅 2026-03-23

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing slot attention models in video object-centric learning often suffer from over-fragmentation, where a single object is redundantly represented by multiple slots due to reconstruction-based objectives. To address this issue, this work proposes SlotCurri, which introduces a novel reconstruction-guided progressive slot allocation mechanism: starting with a small set of coarse-grained slots, the model dynamically expands the number of slots based on reconstruction error. Furthermore, it incorporates a structure-aware loss that preserves local contrast and edge information to sharpen semantic boundaries, and employs recurrent inference with forward–backward frame propagation to enhance temporal consistency. Evaluated on YouTube-VIS and MOVi-C, SlotCurri achieves significant improvements of +6.8 and +8.3 in FG-ARI, respectively, outperforming current state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

Video Object-Centric Learning seeks to decompose raw videos into a small set of object slots, but existing slot-attention models often suffer from severe over-fragmentation. This is because the model is implicitly encouraged to occupy all slots to minimize the reconstruction objective, thereby representing a single object with multiple redundant slots. We tackle this limitation with a reconstruction-guided slot curriculum (SlotCurri). Training starts with only a few coarse slots and progressively allocates new slots where reconstruction error remains high, thus expanding capacity only where it is needed and preventing fragmentation from the outset. Yet, during slot expansion, meaningful sub-parts can emerge only if coarse-level semantics are already well separated; however, with a small initial slot budget and an MSE objective, semantic boundaries remain blurry. Therefore, we augment MSE with a structure-aware loss that preserves local contrast and edge information to encourage each slot to sharpen its semantic boundaries. Lastly, we propose a cyclic inference that rolls slots forward and then backward through the frame sequence, producing temporally consistent object representations even in the earliest frames. All combined, SlotCurri addresses object over-fragmentation by allocating representational capacity where reconstruction fails, further enhanced by structural cues and cyclic inference. Notable FG-ARI gains of +6.8 on YouTube-VIS and +8.3 on MOVi-C validate the effectiveness of SlotCurri. Our code is available at github.com/wjun0830/SlotCurri.

Problem

Research questions and friction points this paper is trying to address.

object over-fragmentation

video object-centric learning

slot attention

reconstruction error

semantic boundaries

Innovation

Methods, ideas, or system contributions that make the work stand out.

slot curriculum

object-centric learning

over-fragmentation