Evaluating Object-Centric Models beyond Object Discovery

πŸ“… 2026-02-07
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing evaluations of object-centric models are largely confined to object discovery and simple reasoning tasks, which inadequately assess their representational capacity under compositional generalization and out-of-distribution robustness. To address this limitation, this work proposes a novel evaluation framework that leverages instruction-tuned vision-language models (VLMs) to probe the support provided by object-centric representations for complex reasoning across diverse visual question answering (VQA) tasks. The framework introduces a unified task design that simultaneously evaluates localization accuracy and representational effectiveness. By employing VLMs as scalable evaluators and integrating multi-feature reconstruction baselines, the approach overcomes the fragmentation inherent in conventional metrics, enabling a more comprehensive, consistent, and scalable assessment of object-centric models’ representational capabilities in complex scenarios.

Technology Category

Application Category

πŸ“ Abstract
Object-centric learning (OCL) aims to learn structured scene representations that support compositional generalization and robustness to out-of-distribution (OOD) data. However, OCL models are often not evaluated regarding these goals. Instead, most prior work focuses on evaluating OCL models solely through object discovery and simple reasoning tasks, such as probing the representation via image classification. We identify two limitations in existing benchmarks: (1) They provide limited insights on the representation usefulness of OCL models, and (2) localization and representation usefulness are assessed using disjoint metrics. To address (1), we use instruction-tuned VLMs as evaluators, enabling scalable benchmarking across diverse VQA datasets to measure how well VLMs leverage OCL representations for complex reasoning tasks. To address (2), we introduce a unified evaluation task and metric that jointly assess localization (where) and representation usefulness (what), thereby eliminating inconsistencies introduced by disjoint evaluation. Finally, we include a simple multi-feature reconstruction baseline as a reference point.
Problem

Research questions and friction points this paper is trying to address.

object-centric learning
compositional generalization
out-of-distribution robustness
evaluation benchmark
representation usefulness
Innovation

Methods, ideas, or system contributions that make the work stand out.

object-centric learning
visual language models
unified evaluation
compositional generalization
out-of-distribution robustness
πŸ”Ž Similar Papers
No similar papers found.