ARGENT: Adaptive Hierarchical Image-Text Representations

πŸ“… 2026-03-24
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitations of existing vision-language models (VLMs) in capturing hierarchical relationships between visual and linguistic concepts within Euclidean space, as well as the issues of entailment cone collapse and unreliable evaluation that plague hyperspherical embedding approaches. To overcome these challenges, the authors propose learning image–text hierarchical representations in hyperbolic space, employing an adaptive entailment loss and norm regularization to prevent cone collapse, alongside a stable training strategy that eliminates the need for heuristic clipping. Furthermore, they introduce a probabilistic entailment protocol (PEP) based on angular similarity, enabling reliable hierarchical evaluation via AUC-ROC and average precision metrics. Experiments demonstrate consistent improvements over state-of-the-art hyperspherical VLMs, with gains of 0.7%, 1.1%, and 0.8% on image classification, text-to-image retrieval, and novel hierarchical evaluation benchmarks, respectively.

Technology Category

Application Category

πŸ“ Abstract
Large-scale Vision-Language Models (VLMs) such as CLIP learn powerful semantic representations but operate in Euclidean space, which fails to capture the inherent hierarchical structure of visual and linguistic concepts. Hyperbolic geometry, with its exponential volume growth, offers a principled alternative for embedding such hierarchies with low distortion. However, existing hyperbolic VLMs use entailment losses that are unstable: as parent embeddings contract toward the origin, their entailment cones widen toward a half-space, causing catastrophic cone collapse that destroys the intended hierarchy. Additionally, hierarchical evaluation of these models remains unreliable, being largely retrieval-based and correlation-based metrics and prone to taxonomy dependence and ambiguous negatives. To address these limitations, we propose an adaptive entailment loss paired with a norm regularizer that prevents cone collapse without heuristic aperture clipping. We further introduce an angle-based probabilistic entailment protocol (PEP) for evaluating hierarchical understanding, scored with AUC-ROC and Average Precision. This paper introduces a stronger hyperbolic VLM baseline ARGENT, Adaptive hieRarchical imaGe-tExt represeNTation. ARGENT improves the SOTA hyperbolic VLM by 0.7, 1.1, and 0.8 absolute points on image classification, text-to-image retrieval, and proposed hierarchical metrics, respectively.
Problem

Research questions and friction points this paper is trying to address.

hyperbolic geometry
vision-language models
hierarchical representation
entailment collapse
hierarchical evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

hyperbolic geometry
adaptive entailment loss
cone collapse prevention
probabilistic entailment protocol
hierarchical vision-language modeling
πŸ”Ž Similar Papers
No similar papers found.
Chuong Huynh
Chuong Huynh
University of Maryland, College Park
Computer VisionDeep LearningVision Language ModelImage Retrieval
Hossein Souri
Hossein Souri
Senior Researcher, Samsung AI Center
Computer VisionVision and LanguageMultimodal ModelsSecurity
A
Abhinav Kumar
Samsung Research America, AI Center – Mountain View
Vitali Petsiuk
Vitali Petsiuk
PhD student, Boston University
explainable aideep learningcomputer visionmachine learning
D
Deen Dayal Mohan
Samsung Research America, AI Center – Mountain View
S
Suren Kumar
Samsung Research America, AI Center – Mountain View