π€ AI Summary
Existing evaluations of object-centric models are largely confined to object discovery and simple reasoning tasks, which inadequately assess their representational capacity under compositional generalization and out-of-distribution robustness. To address this limitation, this work proposes a novel evaluation framework that leverages instruction-tuned vision-language models (VLMs) to probe the support provided by object-centric representations for complex reasoning across diverse visual question answering (VQA) tasks. The framework introduces a unified task design that simultaneously evaluates localization accuracy and representational effectiveness. By employing VLMs as scalable evaluators and integrating multi-feature reconstruction baselines, the approach overcomes the fragmentation inherent in conventional metrics, enabling a more comprehensive, consistent, and scalable assessment of object-centric modelsβ representational capabilities in complex scenarios.
π Abstract
Object-centric learning (OCL) aims to learn structured scene representations that support compositional generalization and robustness to out-of-distribution (OOD) data. However, OCL models are often not evaluated regarding these goals. Instead, most prior work focuses on evaluating OCL models solely through object discovery and simple reasoning tasks, such as probing the representation via image classification. We identify two limitations in existing benchmarks: (1) They provide limited insights on the representation usefulness of OCL models, and (2) localization and representation usefulness are assessed using disjoint metrics. To address (1), we use instruction-tuned VLMs as evaluators, enabling scalable benchmarking across diverse VQA datasets to measure how well VLMs leverage OCL representations for complex reasoning tasks. To address (2), we introduce a unified evaluation task and metric that jointly assess localization (where) and representation usefulness (what), thereby eliminating inconsistencies introduced by disjoint evaluation. Finally, we include a simple multi-feature reconstruction baseline as a reference point.