On the Cone Effect and Modality Gap in Medical Vision-Language Embeddings

📅 2026-03-17

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the "cone effect" in medical vision-language models, which induces a modality gap whose impact on downstream tasks remains unclear. Rather than treating this gap as a defect to be eliminated, the study innovatively formulates it as a tunable hyperparameter and introduces a lightweight post-processing mechanism. Operating under frozen pre-trained encoders—such as CLIP, SigLIP, and their medical variants—the method enables continuous control over the gap magnitude via a single parameter λ. Experiments demonstrate that medical data exhibit heightened sensitivity to modality gap adjustments, and that moderately reducing the gap significantly enhances performance across multimodal tasks. These findings substantiate that dynamically tuning the modality gap yields superior results compared to enforcing rigid alignment.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) exhibit a characteristic "cone effect" in which nonlinear encoders map embeddings into highly concentrated regions of the representation space, contributing to cross-modal separation known as the modality gap. While this phenomenon has been widely observed, its practical impact on supervised multimodal learning -particularly in medical domains- remains unclear. In this work, we introduce a lightweight post-hoc mechanism that keeps pretrained VLM encoders frozen while continuously controlling cross-modal separation through a single hyperparameter {λ}. This enables systematic analysis of how the modality gap affects downstream multimodal performance without expensive retraining. We evaluate generalist (CLIP, SigLIP) and medically specialized (BioMedCLIP, MedSigLIP) models across diverse medical and natural datasets in a supervised multimodal settings. Results consistently show that reducing excessive modality gap improves downstream performance, with medical datasets exhibiting stronger sensitivity to gap modulation; however, fully collapsing the gap is not always optimal, and intermediate, task-dependent separation yields the best results. These findings position the modality gap as a tunable property of multimodal representations rather than a quantity that should be universally minimized.

Problem

Research questions and friction points this paper is trying to address.

cone effect

modality gap

medical vision-language embeddings

multimodal learning

supervised multimodal performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

cone effect

modality gap

vision-language models