Predicting Video Slot Attention Queries from Random Slot-Feature Pairs

📅 2025-08-02

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Existing unsupervised video object-centric learning (OCL) methods suffer from two key limitations: (i) neglecting next-frame features—the most discriminative source for query prediction; and (ii) failing to explicitly model the dynamic knowledge underlying object state transitions. To address these, we propose RandSF.Q, the first method to introduce random slot-feature pairs into the attention-based query mechanism. By randomly sampling training transitions, RandSF.Q explicitly learns inter-frame state dynamics while incorporating next-frame features to enhance query representation. Our approach employs an attention-driven slot update architecture that jointly leverages random sampling and feature aggregation, enabling precise object segmentation and dynamic modeling under fully unsupervised conditions. On standard object discovery benchmarks, RandSF.Q significantly outperforms prior state-of-the-art methods, achieving up to a 10-percentage-point improvement and establishing new SOTA performance.

Technology Category

Application Category

📝 Abstract

Unsupervised video Object-Centric Learning (OCL) is promising as it enables object-level scene representation and dynamics modeling as we humans do. Mainstream video OCL methods adopt a recurrent architecture: An aggregator aggregates current video frame into object features, termed slots, under some queries; A transitioner transits current slots to queries for the next frame. This is an effective architecture but all existing implementations both ( extit{i1}) neglect to incorporate next frame features, the most informative source for query prediction, and ( extit{i2}) fail to learn transition dynamics, the knowledge essential for query prediction. To address these issues, we propose Random Slot-Feature pair for learning Query prediction (RandSF.Q): ( extit{t1}) We design a new transitioner to incorporate both slots and features, which provides more information for query prediction; ( extit{t2}) We train the transitioner to predict queries from slot-feature pairs randomly sampled from available recurrences, which drives it to learn transition dynamics. Experiments on scene representation demonstrate that our method surpass existing video OCL methods significantly, e.g., up to 10 points on object discovery, setting new state-of-the-art. Such superiority also benefits downstream tasks like dynamics modeling. Our core source code and training logs are available as the supplement.

Problem

Research questions and friction points this paper is trying to address.

Predict video slot attention queries from random slot-feature pairs

Improve unsupervised video object-centric learning by incorporating next frame features

Learn transition dynamics for better query prediction in video OCL

Innovation

Methods, ideas, or system contributions that make the work stand out.

Incorporates next frame features for query prediction

Learns transition dynamics from random slot-feature pairs

Uses new transitioner combining slots and features

🔎 Similar Papers

No similar papers found.