Compositional Entailment Learning for Hyperbolic Vision-Language Models

📅 2024-10-09
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Visual-language models struggle to capture the inherent hierarchical structure between image and text concepts. Method: This paper proposes the first compositional entailment learning framework in hyperbolic space, moving beyond conventional pairwise image–text contrastive learning. It introduces a three-level input structure—“image–local object bounding box–corresponding noun”—and leverages hyperbolic geometry’s intrinsic suitability for modeling hierarchies to explicitly encode hierarchical entailment relationships between visual parts and textual concepts. The framework integrates hyperbolic embedding, contrastive learning, textual entailment modeling, and open-vocabulary localization (e.g., GLIP), augmented with automated noun extraction and hierarchical alignment optimization. Contribution/Results: Trained on a million-scale image-text dataset, the model achieves state-of-the-art performance across zero-shot classification, cross-modal retrieval, and hierarchical reasoning tasks, significantly outperforming both Euclidean CLIP and existing hyperbolic approaches.

Technology Category

Application Category

📝 Abstract
Image-text representation learning forms a cornerstone in vision-language models, where pairs of images and textual descriptions are contrastively aligned in a shared embedding space. Since visual and textual concepts are naturally hierarchical, recent work has shown that hyperbolic space can serve as a high-potential manifold to learn vision-language representation with strong downstream performance. In this work, for the first time we show how to fully leverage the innate hierarchical nature of hyperbolic embeddings by looking beyond individual image-text pairs. We propose Compositional Entailment Learning for hyperbolic vision-language models. The idea is that an image is not only described by a sentence but is itself a composition of multiple object boxes, each with their own textual description. Such information can be obtained freely by extracting nouns from sentences and using openly available localized grounding models. We show how to hierarchically organize images, image boxes, and their textual descriptions through contrastive and entailment-based objectives. Empirical evaluation on a hyperbolic vision-language model trained with millions of image-text pairs shows that the proposed compositional learning approach outperforms conventional Euclidean CLIP learning, as well as recent hyperbolic alternatives, with better zero-shot and retrieval generalization and clearly stronger hierarchical performance.
Problem

Research questions and friction points this paper is trying to address.

Leverage hyperbolic space for hierarchical vision-language representation.
Enhance image-text alignment using compositional entailment learning.
Improve zero-shot and retrieval performance in vision-language models.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hyperbolic space for hierarchical vision-language representation
Compositional Entailment Learning with image-text pairs
Contrastive and entailment-based hierarchical organization
🔎 Similar Papers
No similar papers found.