PHyCLIP: $ell_1$-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

Existing vision-language models struggle to jointly model conceptual hierarchies (e.g., dog ≼ mammal ≼ animal) and compositional semantics (e.g., “dog in car” ≼ dog ∧ car). While hyperbolic spaces excel at representing tree-like hierarchies, they lack expressivity for logical composition across distinct concept families. Method: We propose the ℓ₁-product hyperbolic space—a novel geometric framework that couples multiple hyperbolic factor spaces via an ℓ₁-norm metric—enabling the first Boolean-algebraic analogy in vision-language representation learning. Our approach integrates hyperspherical embeddings with contrastive vision-language pretraining. Contribution/Results: The model achieves state-of-the-art performance on zero-shot classification, retrieval, hierarchical classification, and compositional reasoning, significantly outperforming single-space baselines. It offers enhanced generalization, explicit structural interpretability, and principled decoupling of hierarchy and composition within a unified geometric representation.

Technology Category

Application Category

📝 Abstract

Vision-language models have achieved remarkable success in multi-modal representation learning from large-scale pairs of visual scenes and linguistic descriptions. However, they still struggle to simultaneously express two distinct types of semantic structures: the hierarchy within a concept family (e.g., dog $preceq$ mammal $preceq$ animal) and the compositionality across different concept families (e.g., "a dog in a car" $preceq$ dog, car). Recent works have addressed this challenge by employing hyperbolic space, which efficiently captures tree-like hierarchy, yet its suitability for representing compositionality remains unclear. To resolve this dilemma, we propose PHyCLIP, which employs an $ell_1$-Product metric on a Cartesian product of Hyperbolic factors. With our design, intra-family hierarchies emerge within individual hyperbolic factors, and cross-family composition is captured by the $ell_1$-product metric, analogous to a Boolean algebra. Experiments on zero-shot classification, retrieval, hierarchical classification, and compositional understanding tasks demonstrate that PHyCLIP outperforms existing single-space approaches and offers more interpretable structures in the embedding space.

Problem

Research questions and friction points this paper is trying to address.

Unifying hierarchical and compositional structures in vision-language learning

Addressing limitations of hyperbolic space for cross-family compositionality

Developing interpretable embeddings through hyperbolic factor product spaces

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses hyperbolic space for hierarchical relationships

Employs l1-product metric for compositional semantics

Combines hyperbolic factors for unified representation learning

🔎 Similar Papers

Compositional Entailment Learning for Hyperbolic Vision-Language Models