Vector-Quantized Vision Foundation Models for Object-Centric Learning

📅 2025-02-27

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

To address the insufficient representation disentanglement capability of Object-Centric Learning (OCL) in self-supervised reconstruction under complex-textured scenes, this paper proposes Vector-Quantized Vision Foundation Model for OCL (VQ-VFM-OCL). Unlike conventional paradigms that treat Vision Foundation Models (VFMs)—e.g., ViT—as frozen feature extractors, our method directly leverages deep VFM features for object-level clustering and integrates vector quantization (VQ-VAE–style discrete encoding) to strengthen reconstruction supervision, enabling end-to-end trainable object-disentangled representations. Crucially, the framework unifies object discovery and reconstruction objectives within a single architecture. On standard benchmarks—including CLEVR and Multi-dSprites—our approach achieves significant improvements in object discovery accuracy. Moreover, it attains state-of-the-art performance on downstream tasks such as visual prediction and reasoning, demonstrating the effectiveness of learned disentangled object representations.

Technology Category

Application Category

📝 Abstract

Decomposing visual scenes into objects, as humans do, facilitates modeling object relations and dynamics. Object-Centric Learning (OCL) achieves this by aggregating image or video feature maps into object-level feature vectors, known as extit{slots}. OCL's self-supervision via reconstructing the input from slots struggles with complex textures, thus many methods employ Vision Foundation Models (VFMs) to extract feature maps with better objectness. However, using VFMs merely as feature extractors does not fully unlock their potential. We propose Vector-Quantized VFMs for OCL (VQ-VFM-OCL, or VVO), where VFM features are extracted to facilitate object-level information aggregation and further quantized to strengthen supervision in reconstruction. Our VVO unifies OCL representatives into a concise architecture. Experiments demonstrate that VVO not only outperforms mainstream methods on object discovery tasks but also benefits downstream tasks like visual prediction and reasoning. The source code is available in the supplement.

Problem

Research questions and friction points this paper is trying to address.

Enhances object-centric learning via vector quantization.

Improves object discovery and downstream visual tasks.

Unifies object-centric learning with concise architecture.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vector-Quantized Vision Foundation Models

Unified Object-Centric Learning Architecture

Enhanced Object Discovery and Reconstruction

🔎 Similar Papers

Exploring the Effectiveness of Object-Centric Representations in Visual Question Answering: Comparative Insights with Foundation Models