HiMo-CLIP: Modeling Semantic Hierarchy and Monotonicity in Vision-Language Alignment

📅 2025-11-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing contrastive vision-language models (e.g., CLIP) treat text as flat token sequences, failing to capture semantic hierarchy and monotonicity—limiting cross-modal alignment performance for long or compositional descriptions. To address this, we propose HiDe, a hierarchical decomposition module, and MoLo, a monotonicity-aware contrastive loss. Without modifying encoder architectures, our approach enables batch-aware, multi-granularity semantic alignment and text-completeness-driven alignment strength modeling—the first of its kind. Built upon CLIP, HiDe implicitly extracts hierarchical semantics via in-batch PCA, while MoLo enforces monotonic alignment constraints. We further introduce a global–component joint alignment strategy to enhance fine-grained correspondence. Extensive experiments on multiple image–text retrieval benchmarks demonstrate significant improvements over strong baselines, especially under long-text and complex-description scenarios. Ablations confirm that explicitly modeling semantic hierarchy and monotonicity substantially enhances vision–language alignment.

Technology Category

Application Category

📝 Abstract
Contrastive vision-language models like CLIP have achieved impressive results in image-text retrieval by aligning image and text representations in a shared embedding space. However, these models often treat text as flat sequences, limiting their ability to handle complex, compositional, and long-form descriptions. In particular, they fail to capture two essential properties of language: semantic hierarchy, which reflects the multi-level compositional structure of text, and semantic monotonicity, where richer descriptions should result in stronger alignment with visual content.To address these limitations, we propose HiMo-CLIP, a representation-level framework that enhances CLIP-style models without modifying the encoder architecture. HiMo-CLIP introduces two key components: a hierarchical decomposition (HiDe) module that extracts latent semantic components from long-form text via in-batch PCA, enabling flexible, batch-aware alignment across different semantic granularities, and a monotonicity-aware contrastive loss (MoLo) that jointly aligns global and component-level representations, encouraging the model to internalize semantic ordering and alignment strength as a function of textual completeness.These components work in concert to produce structured, cognitively-aligned cross-modal representations. Experiments on multiple image-text retrieval benchmarks show that HiMo-CLIP consistently outperforms strong baselines, particularly under long or compositional descriptions. The code is available at https://github.com/UnicomAI/HiMo-CLIP.
Problem

Research questions and friction points this paper is trying to address.

Enhances CLIP to capture semantic hierarchy in vision-language alignment
Addresses failure to model semantic monotonicity in text descriptions
Improves handling of complex compositional and long-form textual inputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical decomposition module extracts latent semantic components
Monotonicity-aware contrastive loss aligns multi-level representations
Framework enhances CLIP models without modifying encoder architecture
🔎 Similar Papers
No similar papers found.
R
Ruijia Wu
Data Science & Artificial Intelligence Research Institute, China Unicom
P
Ping Chen
Data Science & Artificial Intelligence Research Institute, China Unicom
Fei Shen
Fei Shen
National University of Singapore
Controllable GenerationMultimodal Safety
S
Shaoan Zhao
Data Science & Artificial Intelligence Research Institute, China Unicom
Q
Qiang Hui
Data Science & Artificial Intelligence Research Institute, China Unicom
H
Huanlin Gao
Data Science & Artificial Intelligence Research Institute, China Unicom
Ting Lu
Ting Lu
Hunan University
remote sensing image processing and analysis
Zhaoxiang Liu
Zhaoxiang Liu
China Unicom
Computer VisionDeep LearningRoboticsHuman-Computer Interaction
F
Fang Zhao
Data Science & Artificial Intelligence Research Institute, China Unicom
K
Kai Wang
Data Science & Artificial Intelligence Research Institute, China Unicom
Shiguo Lian
Shiguo Lian
CloudMinds