🤖 AI Summary
This work addresses the challenge of training multimodal large language models under severe scarcity of high-quality paired data and the limited reliability of existing cross-modal interchange methods, which stem from an inadequate understanding of the geometric structure of the modality gap. The study reveals that representations across modalities already share a dominant semantic geometry, and the bottleneck in interchangeability arises from anisotropic residuals concentrated along a few principal directions. To this end, the authors propose AnisoAlign, a framework that reframes the modality gap as a structured geometric discrepancy amenable to correction. By leveraging intrinsic geometric priors from the target modality, AnisoAlign performs bounded anisotropic alignment on the source modality to construct semantically consistent alternative representations—without requiring paired data. Experiments demonstrate substantial improvements in multimodal training under unpaired settings and validate the approach through geometric diagnostics and text-only-driven multimodal modeling.
📝 Abstract
Training multimodal large language models has long been limited by the scarcity of high-quality paired multimodal data. Recent studies show that the shared representation space of pretrained multimodal contrastive models can serve as a bridge, enabling models to perform multimodal training with unimodal data. However, the key premise of this paradigm remains insufficiently understood: can representations from different modalities be reliably interchanged? The core obstacle lies in the persistent Modality Gap in the shared space. In this work, we revisit the geometric nature of the modality gap. We find that modality representations already share compatible dominant semantic geometry. What truly hinders modality interchangeability is not a simple global shift, but an anisotropic residual structure concentrated along a small number of dominant directions. Based on this finding, we further propose the principle of anisotropic modality gap alignment: effective modality alignment should align with the target-modality distribution while preserving the semantic structure of the source modality. Guided by this principle, we propose an anisotropic geometric correction framework, AnisoAlign, for unpaired modality alignment. This framework leverages the internal geometric prior of the target modality and performs bounded correction on source-modality representations, thereby constructing substitute representations in the target modality. Experiments confirm its benefits in both geometric diagnostics and text-only MLLM training. Overall, this work recasts the modality gap from an empirical observation into a correctable, structured geometric phenomenon and provides a new representation alignment perspective for training multimodal models with unimodal data.