(Almost) Free Modality Stitching of Foundation Models

📅 2025-07-14

📈 Citations: 0

✨ Influential: 0

career value

243K/year

🤖 AI Summary

To address the prohibitively high computational cost of jointly optimizing unimodal models and modality-specific connectors in multimodal foundation model construction, this paper proposes Hypernetwork Model Alignment (Hyma). Hyma is the first framework to leverage hypernetworks for cross-modal alignment, unifying model selection and connector training: by exploiting the parameter-prediction capability of hypernetworks, it aligns the representation spaces of all $N imes M$ unimodal combinations within a single training pass—eliminating the need for repetitive grid-search-based training. Evaluated on multiple multimodal benchmarks, Hyma achieves performance on par with exhaustive grid search while reducing the cost of identifying optimal modality pairings by an order of magnitude. This significantly enhances both the efficiency and scalability of multimodal model development.

Technology Category

Application Category

📝 Abstract

Foundation multi-modal models are often designed by stitching of multiple existing pretrained uni-modal models: for example, an image classifier with an autoregressive text model. This stitching process is performed by training a connector module that aims to align the representation-representation or representation-input spaces of these uni-modal models. However, given the complexity of training such connectors on large scale web-based datasets coupled with the ever-increasing number of available pretrained uni-modal models, the task of uni-modal models selection and subsequent connector module training becomes computationally demanding. To address this under-studied critical problem, we propose Hypernetwork Model Alignment (Hyma), a novel all-in-one solution for optimal uni-modal model selection and connector training by leveraging hypernetworks. Specifically, our framework utilizes the parameter prediction capability of a hypernetwork to obtain jointly trained connector modules for $N imes M$ combinations of uni-modal models. In our experiments, Hyma reduces the optimal uni-modal model pair search cost by $10 imes$ (averaged across all experiments), while matching the ranking and trained connector performance obtained via grid search across a suite of diverse multi-modal benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Efficiently selecting optimal uni-modal models for multi-modal stitching

Reducing computational cost of connector module training

Leveraging hypernetworks for joint connector training across model combinations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hypernetwork Model Alignment for model selection

Jointly trained connector modules for combinations

Reduces search cost by 10 times effectively

🔎 Similar Papers

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey