🤖 AI Summary
This work addresses the challenge of extracting object-centric structured representations from complex real-world scenes—characterized by multiple objects and low contrast—under unsupervised conditions. We propose a novel approach based on Cycle-Consistent Generative Adversarial Networks (Cycle-Consistent GANs), which, for the first time, introduces cycle consistency into object-centric representation learning, thereby overcoming limitations of conventional autoencoder architectures. By jointly performing unsupervised object segmentation and modeling a low-dimensional latent space, our method decomposes input images into independent object representations and reconstructs them faithfully. Experiments demonstrate that the proposed method achieves state-of-the-art performance on synthetic data and is currently the only approach capable of effectively handling multi-object, low-contrast real-world images. The learned representations enable object-level manipulation and exhibit strong scalability with respect to both object count and image resolution.
📝 Abstract
Although data generation is often straightforward, extracting information from data is more difficult. Object-centric representation learning can extract information from images in an unsupervised manner. It does so by segmenting an image into its subcomponents: the objects. Each object is then represented in a low-dimensional latent space that can be used for downstream processing. Object-centric representation learning is dominated by autoencoder architectures (AEs). Here, we present ORGAN, a novel approach for object-centric representation learning, which is based on cycle-consistent Generative Adversarial Networks instead. We show that it performs similarly to other state-of-the-art approaches on synthetic datasets, while at the same time being the only approach tested here capable of handling more challenging real-world datasets with many objects and low visual contrast. Complementing these results, ORGAN creates expressive latent space representations that allow for object manipulation. Finally, we show that ORGAN scales well both with respect to the number of objects and the size of the images, giving it a unique edge over current state-of-the-art approaches.