🤖 AI Summary
Integrating low-resource novel modalities—such as satellite/astronomical images, IMU signals, and molecular data—into large language models (LLMs) remains challenging due to prohibitive data and computational requirements.
Method: We propose a hypernetwork-based few-shot modality adaptation framework. It employs a shared projector to unify heterogeneous modality embeddings into a common latent space and introduces a modality-agnostic hypernetwork that generates task-specific adapters for arbitrary-dimensional novel modalities using only 32 labeled samples. To enhance generalization, we incorporate isometric transformations during training to increase representation diversity.
Contribution/Results: Our method achieves comparable performance to full modality fine-tuning on multimodal benchmarks while requiring only 1/64 the training data. It significantly lowers the barrier to modality expansion and enables zero-shot extensibility—allowing seamless integration of previously unseen modalities without retraining the base LLM.
📝 Abstract
Multimodal foundation models can process several modalities. However, since the space of possible modalities is large and evolving over time, training a model from scratch to encompass all modalities is unfeasible. Moreover, integrating a modality into a pre-existing foundation model currently requires a significant amount of paired data, which is often not available for low-resource modalities. In this paper, we introduce a method for sample-efficient modality integration (SEMI) into Large Language Models (LLMs). To this end, we devise a hypernetwork that can adapt a shared projector -- placed between modality-specific encoders and an LLM -- to any modality. The hypernetwork, trained on high-resource modalities (i.e., text, speech, audio, video), is conditioned on a few samples from any arbitrary modality at inference time to generate a suitable adapter. To increase the diversity of training modalities, we artificially multiply the number of encoders through isometric transformations. We find that SEMI achieves a significant boost in sample efficiency during few-shot integration of new modalities (i.e., satellite images, astronomical images, inertial measurements, and molecules) with encoders of arbitrary embedding dimensionality. For instance, to reach the same accuracy as 32-shot SEMI, training the projector from scratch needs 64$ imes$ more data. As a result, SEMI holds promise to extend the modality coverage of foundation models.