🤖 AI Summary
This work investigates whether human-like semantic compositionality—i.e., interpretable decomposition and recomposition of image representations along semantic parts—exists in the visual embedding space of vision-language models (VLMs). Conventional linear compositional analysis fails on visual embeddings due to high noise and sparsity in image data. To address this, we propose Geodesically Decomposable Embeddings (GDE), a geometry-aware framework that replaces linear assumptions with manifold geodesic structure to model nonlinear semantic composition. We provide the first systematic empirical validation that mainstream VLMs exhibit significant, interpretable compositionality in their visual embeddings. GDE outperforms linear baselines on compositional classification and surpasses specialized methods in group robustness. Moreover, it reveals that VLMs implicitly possess automatic compositional reasoning capabilities. Our findings establish a new paradigm for interpretable and structured visual understanding in VLMs.
📝 Abstract
Vision-Language Models (VLMs) learn a shared feature space for text and images, enabling the comparison of inputs of different modalities. While prior works demonstrated that VLMs organize natural language representations into regular structures encoding composite meanings, it remains unclear if compositional patterns also emerge in the visual embedding space. In this work, we investigate compositionality in the image domain, where the analysis of compositional properties is challenged by noise and sparsity of visual data. We address these problems and propose a framework, called Geodesically Decomposable Embeddings (GDE), that approximates image representations with geometry-aware compositional structures in the latent space. We demonstrate that visual embeddings of pre-trained VLMs exhibit a compositional arrangement, and evaluate the effectiveness of this property in the tasks of compositional classification and group robustness. GDE achieves stronger performance in compositional classification compared to its counterpart method that assumes linear geometry of the latent space. Notably, it is particularly effective for group robustness, where we achieve higher results than task-specific solutions. Our results indicate that VLMs can automatically develop a human-like form of compositional reasoning in the visual domain, making their underlying processes more interpretable. Code is available at https://github.com/BerasiDavide/vlm_image_compositionality.