π€ AI Summary
Existing concept bottleneck models (CBMs) based on global image encodings suffer from limited expressivity, hindering their application to complex visual reasoning tasksβsuch as multi-object recognition, multi-label classification, and structured reasoning. To address this, we propose the Object-Centric Concept Bottleneck Model (OC-CBM), the first CBM framework grounded in the object-centric paradigm. OC-CBM integrates pretrained foundation models (e.g., CLIP or SAM) with object detection or segmentation modules to construct an object-level concept encoder and a learnable aggregation mechanism. This enables fine-grained, interpretable concept activation and linear decision-making at the object level. By decoupling concept learning from global image representations, OC-CBM significantly improves accuracy and concept-level verifiability on multi-object classification and attribute reasoning tasks, while preserving strong representational capacity and model transparency.
π Abstract
Developing high-performing, yet interpretable models remains a critical challenge in modern AI. Concept-based models (CBMs) attempt to address this by extracting human-understandable concepts from a global encoding (e.g., image encoding) and then applying a linear classifier on the resulting concept activations, enabling transparent decision-making. However, their reliance on holistic image encodings limits their expressiveness in object-centric real-world settings and thus hinders their ability to solve complex vision tasks beyond single-label classification. To tackle these challenges, we introduce Object-Centric Concept Bottlenecks (OCB), a framework that combines the strengths of CBMs and pre-trained object-centric foundation models, boosting performance and interpretability. We evaluate OCB on complex image datasets and conduct a comprehensive ablation study to analyze key components of the framework, such as strategies for aggregating object-concept encodings. The results show that OCB outperforms traditional CBMs and allows one to make interpretable decisions for complex visual tasks.