🤖 AI Summary
Current intelligent systems struggle to extract compositional, interpretable abstract concepts from visual scenes and achieve semantic grounding. To address this, we propose the Neural Slot Interpreter (NSI), which constructs object-centric slot-based representations organized via an XML-like structured schema to encode object semantics, and introduces slot-level cross-modal contrastive learning for interpretable, disentangled grounding of semantics to spatial entities. Our key contributions are: (1) the first slot-level semantic grounding framework that decouples representations from pixel grids and explicitly binds semantics to objects; (2) a slot-aware ViT tokenizer and structured slot encoder. Experiments demonstrate that NSI achieves superior performance: in few-shot classification, only 10 slot tokens outperform standard patch tokens; grounding accuracy and retrieval interpretability significantly surpass bounding-box–based methods; object discovery F1 improves by 23%, and annotation data efficiency increases by 2.1×.
📝 Abstract
Several accounts of human cognition posit that our intelligence is rooted in our ability to form abstract composable concepts, ground them in our environment, and reason over these grounded entities. This trifecta of human thought has remained elusive in modern intelligent machines. In this work, we investigate whether slot representations extracted from visual scenes serve as appropriate compositional abstractions for grounding and reasoning. We present the Neural Slot Interpreter (NSI), which learns to ground object semantics in slots. At the core of NSI is an XML-like schema that uses simple syntax rules to organize the object semantics of a scene into object-centric schema primitives. Then, the NSI metric learns to ground primitives into slots through a structured contrastive learning objective that reasons over the intermodal alignment. Experiments with a bi-modal object-property and scene retrieval task demonstrate the grounding efficacy and interpretability of correspondences learned by NSI. From a scene representation standpoint, we find that emergent NSI slots that move beyond the image grid by binding to spatial objects facilitate improved visual grounding compared to conventional bounding-box-based approaches. From a data efficiency standpoint, we empirically validate that NSI learns more generalizable representations from a fixed amount of annotation data than the traditional approach. We also show that the grounded slots surpass unsupervised slots in real-world object discovery and scale with scene complexity. Finally, we investigate the reasoning abilities of the grounded slots. Vision Transformers trained on grounding-aware NSI tokenizers using as few as ten tokens outperform patch-based tokens on challenging few-shot classification tasks.