🤖 AI Summary
This work addresses two key limitations in multi-object instance retrieval: (i) self-supervised visual representations (e.g., DINO) lack fine-grained object-level discriminability (e.g., color), while (ii) slot-based methods suffer from insufficient global semantic understanding. We propose a lightweight, plug-and-play fusion framework that requires neither fine-tuning nor retraining. Our core innovation is the first multi-scale alignment and fusion of DINO-extracted scene-level global features with object-centric latent vectors generated by a VAE trained on image-patch segmentation—thereby bridging scene-level and instance-level representations. Crucially, the method preserves the generalization capability of pretrained models and introduces only a small learnable fusion module. On multiple fine-grained multi-object retrieval benchmarks, our approach significantly outperforms both vanilla DINO and state-of-the-art slot-based baselines, empirically validating the effectiveness and practicality of global–local collaborative modeling.
📝 Abstract
Object-centric learning is fundamental to human vision and crucial for models requiring complex reasoning. Traditional approaches rely on slot-based bottlenecks to learn object properties explicitly, while recent self-supervised vision models like DINO have shown emergent object understanding. However, DINO representations primarily capture global scene features, often confounding individual object attributes. We investigate the effectiveness of DINO representations and slot-based methods for multi-object instance retrieval. Our findings reveal that DINO representations excel at capturing global object attributes such as object shape and size, but struggle with object-level details like colour, whereas slot-based representations struggle at both global and object-level understanding. To address this, we propose a method that combines global and local features by augmenting DINO representations with object-centric latent vectors from a Variational Autoencoder trained on segmented image patches that are extracted from the DINO features. This approach improves multi-object instance retrieval performance, bridging the gap between global scene understanding and fine-grained object representation without requiring full model retraining.