Sample-efficient Integration of New Modalities into Large Language Models

📅 2025-09-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Integrating low-resource novel modalities—such as satellite/astronomical images, IMU signals, and molecular data—into large language models (LLMs) remains challenging due to prohibitive data and computational requirements. Method: We propose a hypernetwork-based few-shot modality adaptation framework. It employs a shared projector to unify heterogeneous modality embeddings into a common latent space and introduces a modality-agnostic hypernetwork that generates task-specific adapters for arbitrary-dimensional novel modalities using only 32 labeled samples. To enhance generalization, we incorporate isometric transformations during training to increase representation diversity. Contribution/Results: Our method achieves comparable performance to full modality fine-tuning on multimodal benchmarks while requiring only 1/64 the training data. It significantly lowers the barrier to modality expansion and enables zero-shot extensibility—allowing seamless integration of previously unseen modalities without retraining the base LLM.

Technology Category

Application Category

📝 Abstract
Multimodal foundation models can process several modalities. However, since the space of possible modalities is large and evolving over time, training a model from scratch to encompass all modalities is unfeasible. Moreover, integrating a modality into a pre-existing foundation model currently requires a significant amount of paired data, which is often not available for low-resource modalities. In this paper, we introduce a method for sample-efficient modality integration (SEMI) into Large Language Models (LLMs). To this end, we devise a hypernetwork that can adapt a shared projector -- placed between modality-specific encoders and an LLM -- to any modality. The hypernetwork, trained on high-resource modalities (i.e., text, speech, audio, video), is conditioned on a few samples from any arbitrary modality at inference time to generate a suitable adapter. To increase the diversity of training modalities, we artificially multiply the number of encoders through isometric transformations. We find that SEMI achieves a significant boost in sample efficiency during few-shot integration of new modalities (i.e., satellite images, astronomical images, inertial measurements, and molecules) with encoders of arbitrary embedding dimensionality. For instance, to reach the same accuracy as 32-shot SEMI, training the projector from scratch needs 64$ imes$ more data. As a result, SEMI holds promise to extend the modality coverage of foundation models.
Problem

Research questions and friction points this paper is trying to address.

Efficiently integrating new modalities into LLMs
Reducing data requirements for modality integration
Enabling few-shot learning for low-resource modalities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hypernetwork adapts shared projector to modalities
Artificial encoder multiplication via isometric transformations
Few-shot conditioning enables sample-efficient modality integration
🔎 Similar Papers
No similar papers found.