With Limited Data for Multimodal Alignment, Let the STRUCTURE Guide You

📅 2025-06-20

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Existing multimodal alignment methods rely heavily on large-scale paired data (often millions of samples), yet such abundant supervision is rarely available in real-world scenarios. To address this scarcity, this paper introduces a few-shot cross-modal alignment paradigm that requires only tens of thousands of paired samples (<1% of conventional scale) to jointly align pre-trained unimodal encoders. Our key contributions are: (1) STRUCTURE regularization, which preserves intra-modal semantic relationships by constraining the geometric structure of neighborhoods in the latent space; and (2) an empirical discovery—validated across multiple benchmarks—that cross-modal representation similarity is more effective at intermediate layers than at final layers, prompting a novel layer-wise alignment objective. Evaluated on 24 zero-shot image classification and cross-modal retrieval benchmarks, our method achieves average relative improvements of 51.6% (classification) and 91.8% (retrieval), substantially advancing the state of few-shot multimodal modeling.

Technology Category

Application Category

📝 Abstract

Multimodal models have demonstrated powerful capabilities in complex tasks requiring multimodal alignment including zero-shot classification and cross-modal retrieval. However, existing models typically rely on millions of paired multimodal samples, which are prohibitively expensive or infeasible to obtain in many domains. In this work, we explore the feasibility of building multimodal models with limited amount of paired data by aligning pretrained unimodal foundation models. We show that high-quality alignment is possible with as few as tens of thousands of paired samples$unicode{x2013}$less than $1%$ of the data typically used in the field. To achieve this, we introduce STRUCTURE, an effective regularization technique that preserves the neighborhood geometry of the latent space of unimodal encoders. Additionally, we show that aligning last layers is often suboptimal and demonstrate the benefits of aligning the layers with the highest representational similarity across modalities. These two components can be readily incorporated into existing alignment methods, yielding substantial gains across 24 zero-shot image classification and retrieval benchmarks, with average relative improvement of $51.6%$ in classification and $91.8%$ in retrieval tasks. Our results highlight the effectiveness and broad applicability of our framework for limited-sample multimodal learning and offer a promising path forward for resource-constrained domains.

Problem

Research questions and friction points this paper is trying to address.

Aligning multimodal models with limited paired data

Preserving latent space geometry for effective regularization

Optimizing layer alignment for cross-modal representation similarity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Aligns pretrained unimodal models with limited data

Uses STRUCTURE to preserve latent space geometry

Aligns layers with highest cross-modal similarity

🔎 Similar Papers

No similar papers found.