๐ค AI Summary
This work addresses the challenge of unsupervised learning of object-centric representations directly from raw, unstructured perceptual data. We propose the first end-to-end differentiable neuro-symbolic framework that integrates probabilistic logic programming (ProbLog) with neural networks. Our method requires only task-level distant supervisionโno object annotations, predefined segmentation masks, or explicit object priors. By jointly optimizing a probabilistic logic reasoning layer with a visual perception module, the model autonomously discovers semantically coherent object structures. Empirically, it significantly outperforms both pure neural networks and existing neuro-symbolic approaches on challenging benchmarks involving compositional object generalization, zero-shot task transfer, and variable numbers of objects. To our knowledge, this is the first approach to enable differentiable, logic-guided learning of object-centric representations under distant supervision.
๐ Abstract
Relational learning enables models to generalize across structured domains by reasoning over objects and their interactions. While recent advances in neurosymbolic reasoning and object-centric learning bring us closer to this goal, existing systems rely either on object-level supervision or on a predefined decomposition of the input into objects. In this work, we propose a neurosymbolic formulation for learning object-centric representations directly from raw unstructured perceptual data and using only distant supervision. We instantiate this approach in DeepObjectLog, a neurosymbolic model that integrates a perceptual module, which extracts relevant object representations, with a symbolic reasoning layer based on probabilistic logic programming. By enabling sound probabilistic logical inference, the symbolic component introduces a novel learning signal that further guides the discovery of meaningful objects in the input. We evaluate our model across a diverse range of generalization settings, including unseen object compositions, unseen tasks, and unseen number of objects. Experimental results show that our method outperforms neural and neurosymbolic baselines across the tested settings.