(Almost) Free Modality Stitching of Foundation Models

📅 2025-07-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the prohibitively high computational cost of jointly optimizing unimodal models and modality-specific connectors in multimodal foundation model construction, this paper proposes Hypernetwork Model Alignment (Hyma). Hyma is the first framework to leverage hypernetworks for cross-modal alignment, unifying model selection and connector training: by exploiting the parameter-prediction capability of hypernetworks, it aligns the representation spaces of all $N imes M$ unimodal combinations within a single training pass—eliminating the need for repetitive grid-search-based training. Evaluated on multiple multimodal benchmarks, Hyma achieves performance on par with exhaustive grid search while reducing the cost of identifying optimal modality pairings by an order of magnitude. This significantly enhances both the efficiency and scalability of multimodal model development.

Technology Category

Application Category

📝 Abstract
Foundation multi-modal models are often designed by stitching of multiple existing pretrained uni-modal models: for example, an image classifier with an autoregressive text model. This stitching process is performed by training a connector module that aims to align the representation-representation or representation-input spaces of these uni-modal models. However, given the complexity of training such connectors on large scale web-based datasets coupled with the ever-increasing number of available pretrained uni-modal models, the task of uni-modal models selection and subsequent connector module training becomes computationally demanding. To address this under-studied critical problem, we propose Hypernetwork Model Alignment (Hyma), a novel all-in-one solution for optimal uni-modal model selection and connector training by leveraging hypernetworks. Specifically, our framework utilizes the parameter prediction capability of a hypernetwork to obtain jointly trained connector modules for $N imes M$ combinations of uni-modal models. In our experiments, Hyma reduces the optimal uni-modal model pair search cost by $10 imes$ (averaged across all experiments), while matching the ranking and trained connector performance obtained via grid search across a suite of diverse multi-modal benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Efficiently selecting optimal uni-modal models for multi-modal stitching
Reducing computational cost of connector module training
Leveraging hypernetworks for joint connector training across model combinations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hypernetwork Model Alignment for model selection
Jointly trained connector modules for combinations
Reduces search cost by 10 times effectively
🔎 Similar Papers
No similar papers found.