🤖 AI Summary
This work addresses the limited zero-shot transfer performance between unpaired modalities in existing unified multimodal embedding models, which arises due to supervision being available only for a subset of modality pairs. To overcome this, the authors propose an embedding-level bridging framework that introduces noisy bridge anchor points and aligns proxy embeddings within orthogonal subspaces. This approach enhances cross-modal connectivity without requiring full supervision across all modality pairs. By preserving the original anchor alignment structure while leveraging orthogonality constraints to prevent gradient interference, the method achieves stronger emergent alignment capabilities. Evaluated on nine diverse datasets spanning multiple modalities, the proposed framework consistently outperforms current baselines in both zero-shot classification and cross-modal retrieval tasks.
📝 Abstract
Unified multimodal embedding spaces underpin practical applications such as cross-modal retrieval and zero-shot recognition. In many real deployments, however, supervision is available only for a small subset of modality pairs (e.g., image--text), leaving \emph{unpaired} modality pairs (e.g., audio$\leftrightarrow$depth, infrared$\leftrightarrow$audio) weakly connected and thus performing poorly on zero-shot transfer. Addressing this sparse-pairing regime is therefore essential for scaling unified embedding systems to new tasks without curating exhaustive pairwise data. We propose \textbf{EmergentBridge}, an embedding-level bridging framework that improves performance on these unpaired pairs \emph{without requiring exhaustive pairwise supervision}. Our key observation is that naively aligning a new modality to a synthesized proxy embedding can introduce \emph{gradient interference}, degrading the anchor-alignment structure that existing retrieval/classification relies on. EmergentBridge addresses this by (i) learning a mapping that produces a \emph{noisy bridge anchor} (a proxy embedding of an already-aligned modality) from an anchor embedding, and (ii) enforcing proxy alignment only in the subspace orthogonal to the anchor-alignment direction, preserving anchor alignment while strengthening non-anchor connectivity. Across nine datasets spanning multiple modalities, EmergentBridge consistently outperforms prior binding baselines on zero-shot classification and retrieval, demonstrating strong emergent alignment.