🤖 AI Summary
To address the prohibitively high computational cost of jointly optimizing unimodal models and modality-specific connectors in multimodal foundation model construction, this paper proposes Hypernetwork Model Alignment (Hyma). Hyma is the first framework to leverage hypernetworks for cross-modal alignment, unifying model selection and connector training: by exploiting the parameter-prediction capability of hypernetworks, it aligns the representation spaces of all $N imes M$ unimodal combinations within a single training pass—eliminating the need for repetitive grid-search-based training. Evaluated on multiple multimodal benchmarks, Hyma achieves performance on par with exhaustive grid search while reducing the cost of identifying optimal modality pairings by an order of magnitude. This significantly enhances both the efficiency and scalability of multimodal model development.
📝 Abstract
Foundation multi-modal models are often designed by stitching of multiple existing pretrained uni-modal models: for example, an image classifier with an autoregressive text model. This stitching process is performed by training a connector module that aims to align the representation-representation or representation-input spaces of these uni-modal models. However, given the complexity of training such connectors on large scale web-based datasets coupled with the ever-increasing number of available pretrained uni-modal models, the task of uni-modal models selection and subsequent connector module training becomes computationally demanding. To address this under-studied critical problem, we propose Hypernetwork Model Alignment (Hyma), a novel all-in-one solution for optimal uni-modal model selection and connector training by leveraging hypernetworks. Specifically, our framework utilizes the parameter prediction capability of a hypernetwork to obtain jointly trained connector modules for $N imes M$ combinations of uni-modal models. In our experiments, Hyma reduces the optimal uni-modal model pair search cost by $10 imes$ (averaged across all experiments), while matching the ranking and trained connector performance obtained via grid search across a suite of diverse multi-modal benchmarks.