Not Only Text: Exploring Compositionality of Visual Representations in Vision-Language Models

📅 2025-03-21

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work investigates whether human-like semantic compositionality—i.e., interpretable decomposition and recomposition of image representations along semantic parts—exists in the visual embedding space of vision-language models (VLMs). Conventional linear compositional analysis fails on visual embeddings due to high noise and sparsity in image data. To address this, we propose Geodesically Decomposable Embeddings (GDE), a geometry-aware framework that replaces linear assumptions with manifold geodesic structure to model nonlinear semantic composition. We provide the first systematic empirical validation that mainstream VLMs exhibit significant, interpretable compositionality in their visual embeddings. GDE outperforms linear baselines on compositional classification and surpasses specialized methods in group robustness. Moreover, it reveals that VLMs implicitly possess automatic compositional reasoning capabilities. Our findings establish a new paradigm for interpretable and structured visual understanding in VLMs.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) learn a shared feature space for text and images, enabling the comparison of inputs of different modalities. While prior works demonstrated that VLMs organize natural language representations into regular structures encoding composite meanings, it remains unclear if compositional patterns also emerge in the visual embedding space. In this work, we investigate compositionality in the image domain, where the analysis of compositional properties is challenged by noise and sparsity of visual data. We address these problems and propose a framework, called Geodesically Decomposable Embeddings (GDE), that approximates image representations with geometry-aware compositional structures in the latent space. We demonstrate that visual embeddings of pre-trained VLMs exhibit a compositional arrangement, and evaluate the effectiveness of this property in the tasks of compositional classification and group robustness. GDE achieves stronger performance in compositional classification compared to its counterpart method that assumes linear geometry of the latent space. Notably, it is particularly effective for group robustness, where we achieve higher results than task-specific solutions. Our results indicate that VLMs can automatically develop a human-like form of compositional reasoning in the visual domain, making their underlying processes more interpretable. Code is available at https://github.com/BerasiDavide/vlm_image_compositionality.

Problem

Research questions and friction points this paper is trying to address.

Investigates compositionality in visual embeddings of VLMs

Proposes GDE framework for geometry-aware compositional structures

Evaluates compositional classification and group robustness performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Geometry-aware compositional structures in latent space

Geodesically Decomposable Embeddings (GDE) framework

Enhancing compositional classification and group robustness

🔎 Similar Papers

Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions