The Geometry of Compromise: Unlocking Generative Capabilities via Controllable Modality Alignment

📅 2026-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the modality gap between image and text embeddings in vision-language models, which hinders performance in cross-modal tasks such as text-to-image generation and joint clustering. The authors introduce a geometric analysis that decouples this gap into centroid and distribution gaps, revealing the latter as the primary bottleneck affecting task quality. To mitigate this issue, they propose TPC-CMA, a three-phase curriculum fine-tuning framework augmented with a gradient-aware scheduling strategy, which progressively aligns both components in a controlled manner. Experimental results demonstrate that at α=0.5, the overall modality gap is reduced by 82.3%, the Adjusted Rand Index (ARI) for clustering improves from 0.318 to 0.516, and the CIDEr score for captioning increases by 57.1%, with only marginal degradation in downstream task accuracy.
📝 Abstract
Vision-Language Models (VLMs) such as CLIP learn a shared embedding space for images and text, yet their representations remain geometrically separated, a phenomenon known as the modality gap. This gap limits tasks requiring cross-modal interchangeability, such as captioning and joint clustering. Existing post-processing approaches can partially improve cross-modal compatibility; however, we show through geometric analysis that they primarily reduce the global centroid offset while leaving the underlying distributional mismatch intact. We decompose the modality gap into a Centroid Gap and a Distribution Gap, and demonstrate that the Distribution Gap is the true predictor of cross-modal task quality ($R^2 = 0.986$), whereas the commonly used Raw Gap is misleading ($R^2 = 0.691$). Motivated by this observation, we propose TPC-CMA (Three-Phase Curriculum for Cross-Modal Alignment), a fine-tuning framework that explicitly reduces both components. The proposed CMA jointly mitigates centroid offsets and reshapes the distributional structure, while a three-phase curriculum with gradient-aware scheduling progressively introduces alignment during training to enable stable optimization. Experiments demonstrate that our method significantly improves cross-modal alignment. With $α_{\text{target}}{=}0.05$, the modality gap is reduced by 66.6\% with only 4.84\% accuracy drop. Under stronger alignment ($α_{\text{target}}{=}0.5$), the gap is reduced by 82.3\%, clustering ARI improves from 0.318 to 0.516, and captioning CIDEr increases by 57.1\% over the original model. Our code and pre-trained models will be made publicly available upon acceptance.
Problem

Research questions and friction points this paper is trying to address.

modality gap
vision-language models
cross-modal alignment
distributional mismatch
centroid offset
Innovation

Methods, ideas, or system contributions that make the work stand out.

modality gap
distribution alignment
cross-modal representation
curriculum learning
vision-language models
🔎 Similar Papers
No similar papers found.
Hongyuan Liu
Hongyuan Liu
Stevens Institute of Technology
Parallel ComputingComputer ArchitectureGPUs
Q
Qinli Yang
University of Electronic Science and Technology of China
W
Wen Li
University of Bristol
Zhong Zhang
Zhong Zhang
Tsinghua University
Large Language ModelsLLM AgentsNatural Language Processing
J
Jiaming Liu
University of Electronic Science and Technology of China
W
Wei Han
University of Electronic Science and Technology of China
Z
Zhili Qin
University of Electronic Science and Technology of China
J
Jinxia Guo
University of Electronic Science and Technology of China
Junming Shao
Junming Shao
Professor of Computer Science, University of Electronic Science and Technology of China
Data MiningMachine LearningNeuroimaging