🤖 AI Summary
This work addresses the high computational cost and redundancy in traditional video object-centric learning, which relies on learnable dynamic modules to predict future object representations for temporal consistency. The authors propose Grounded Correspondence, a novel framework that eliminates learnable temporal modeling by directly applying the Hungarian algorithm to align object slot representations across frames. Leveraging a frozen, self-supervised vision backbone to extract discriminative instance features, the method achieves identity consistency through deterministic bipartite matching. Evaluated on MOVi-D, MOVi-E, and YouTube-VIS benchmarks, the approach attains competitive performance while requiring zero learnable parameters for temporal modeling, substantially improving both efficiency and architectural simplicity.
📝 Abstract
The de facto approach in video object-centric learning maintains temporal consistency through learned dynamics modules that predict future object representations, called slots. We demonstrate that these predictors function as expensive approximations of discrete correspondence problems. Modern self-supervised vision backbones already encode instance-discriminative features that distinguish objects reliably. Exploiting these features eliminates the need for learned temporal prediction. We introduce Grounded Correspondence, a framework that replaces learned transition functions with deterministic bipartite matching. Slots initialize from salient regions in frozen backbone features. Frame-to-frame identity is maintained through Hungarian matching on slot representations. The approach requires zero learnable parameters for temporal modeling yet achieves competitive performance on MOVi-D, MOVi-E, and YouTube-VIS. Project page: https://magenta-sherbet-85b101.netlify.app/