SMA: Submodular Modality Aligner For Data Efficient Multimodal Learning

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work addresses the challenges faced by multimodal foundation models in low-data and rare scenarios, where aligned data is scarce and modality gaps are pronounced. Conventional instance-level alignment methods often overlook cross-modal geometric structures. To overcome these limitations, the paper proposes a compositional multimodal alignment paradigm that models multiple augmentations and descriptions of an entity as a set, thereby transcending traditional pairwise learning frameworks. The approach introduces a submodular mutual information (SMI)-based optimization objective that jointly maximizes inter-modal mutual information while minimizing cross-modal discrepancies. Remarkably, with only tens of thousands of samples—orders of magnitude fewer than typical benchmarks—the method significantly outperforms existing baselines across 14 zero-shot classification and retrieval tasks on the CLIP benchmark, demonstrating strong generalization under extreme data scarcity.

📝 Abstract

Despite the recent success of Multimodal Foundation Models (FMs), their reliance on massive paired datasets limits their applicability in low-data and rare-scenario settings where aligned data is scarce and expensive. A key bottleneck is the adoption of an instance-level formulation, which learns alignment by maximizing correlation between individual image-text pairs while neglecting the underlying geometric structure across modalities resulting in a modality gap across input modalities. In this paper, we propose a combinatorial paradigm for multimodal alignment that moves beyond pairwise learning and introduce the \emph{Submodular Modality Aligner (SMA)}, which treats multiple augmentations and descriptions of an entity as a set, leveraging multiple descriptions of the data to capture richer cross-modal structure. We instantiate SMA using a principled objective based on Submodular Mutual Information (SMI), which jointly maximizes inter-modality mutual information while reducing cross-modal divergence. This formulation enables the model to effectively utilize multiple positive associations and extract significantly more information from limited data. We evaluate SMA on 14 zero-shot classification and retrieval tasks from the CLIP benchmark and demonstrate consistent gains in the low-data regime. Notably, SMA achieves strong multimodal generalization using only tens of thousands of samples. This is orders of magnitude fewer than standard approaches. Our results highlight the importance of set-based formulations and submodular objectives for data-efficient multimodal learning.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Learning

Data Efficiency

Modality Gap

Low-data Regime

Cross-modal Alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Submodular Mutual Information

Multimodal Alignment

Data-Efficient Learning