HarmoCLIP: Harmonizing Global and Regional Representations in Contrastive Vision-Language Models

📅 2025-11-27

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Contrastive vision-language models (e.g., CLIP) excel at global semantic alignment but lack fine-grained region–phrase alignment capability; existing enhancements often compromise global consistency to improve local perception. This paper proposes a multi-granularity collaborative alignment framework that preserves CLIP’s global contrastive loss while introducing explicit region–phrase alignment supervision and a fine-grained semantic alignment loss. The framework enables direct matching between image regions and textual phrases, jointly modeling semantic spaces across multiple granularities—global, regional, and phrase-level. By harmonizing these alignment objectives, the method overcomes the inherent trade-off between global and local performance. Experiments demonstrate substantial improvements: up to 69.78% relative gain in cross-modal retrieval accuracy, and a 3.2% absolute increase in bounding-box classification Top-1 accuracy—outperforming state-of-the-art approaches significantly.

Technology Category

Application Category

📝 Abstract

Contrastive Language-Image Pre-training (CLIP) has demonstrated remarkable generalization ability and strong performance across a wide range of vision-language tasks. However, due to the lack of region-level supervision, CLIP exhibits limited fine-grained semantic understanding. Although several methods attempt to mitigate this issue, they unintentionally disrupt the global alignment, resulting in a persistent trade-off where improving local perception simultaneously degrades global coherence. In this paper, we propose HarmoCLIP, a novel framework designed to harmonize global and region representations within CLIP. We first identify that the absence of direct alignment between local textual and visual semantics is the fundamental cause of the trade-off. To address this, HarmoCLIP introduces an explicit fine-grained semantic supervision term that directly aligns textual segments with their corresponding visual regions, effectively bridging the image region space and the textual space. To further strengthen the representation capability at the local level, our method introduces a novel Region-Language Alignment supervision strategy that promotes fine-grained semantic learning without compromising global semantic consistency. Extensive experiments demonstrate that HarmoCLIP achieves state-of-the-art (improvement up to 69.78%) performance on the global task of retrieval and yields a substantial 3.2% improvement in Top-1 accuracy on the region task of bounding-box classification, consistently outperforming prior approaches while providing a balanced, efficient, and plug-and-play solution to the global-local trade-off in CLIP. Code is available at https://github.com/Erosist/HarmoCLIP.

Problem

Research questions and friction points this paper is trying to address.

Harmonizes global and regional representations in CLIP models

Addresses trade-off between local perception and global alignment

Improves fine-grained semantic understanding without disrupting coherence

Innovation

Methods, ideas, or system contributions that make the work stand out.

HarmoCLIP aligns textual segments with visual regions directly

It introduces Region-Language Alignment for fine-grained semantic learning

The method balances global and local representations without trade-off

🔎 Similar Papers

RankCLIP: Ranking-Consistent Language-Image Pretraining