EmergentBridge: Improving Zero-Shot Cross-Modal Transfer in Unified Multimodal Embedding Models

📅 2026-04-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

191K/year
🤖 AI Summary
This work addresses the limited zero-shot transfer performance between unpaired modalities in existing unified multimodal embedding models, which arises due to supervision being available only for a subset of modality pairs. To overcome this, the authors propose an embedding-level bridging framework that introduces noisy bridge anchor points and aligns proxy embeddings within orthogonal subspaces. This approach enhances cross-modal connectivity without requiring full supervision across all modality pairs. By preserving the original anchor alignment structure while leveraging orthogonality constraints to prevent gradient interference, the method achieves stronger emergent alignment capabilities. Evaluated on nine diverse datasets spanning multiple modalities, the proposed framework consistently outperforms current baselines in both zero-shot classification and cross-modal retrieval tasks.

Technology Category

Application Category

📝 Abstract
Unified multimodal embedding spaces underpin practical applications such as cross-modal retrieval and zero-shot recognition. In many real deployments, however, supervision is available only for a small subset of modality pairs (e.g., image--text), leaving \emph{unpaired} modality pairs (e.g., audio$\leftrightarrow$depth, infrared$\leftrightarrow$audio) weakly connected and thus performing poorly on zero-shot transfer. Addressing this sparse-pairing regime is therefore essential for scaling unified embedding systems to new tasks without curating exhaustive pairwise data. We propose \textbf{EmergentBridge}, an embedding-level bridging framework that improves performance on these unpaired pairs \emph{without requiring exhaustive pairwise supervision}. Our key observation is that naively aligning a new modality to a synthesized proxy embedding can introduce \emph{gradient interference}, degrading the anchor-alignment structure that existing retrieval/classification relies on. EmergentBridge addresses this by (i) learning a mapping that produces a \emph{noisy bridge anchor} (a proxy embedding of an already-aligned modality) from an anchor embedding, and (ii) enforcing proxy alignment only in the subspace orthogonal to the anchor-alignment direction, preserving anchor alignment while strengthening non-anchor connectivity. Across nine datasets spanning multiple modalities, EmergentBridge consistently outperforms prior binding baselines on zero-shot classification and retrieval, demonstrating strong emergent alignment.
Problem

Research questions and friction points this paper is trying to address.

zero-shot cross-modal transfer
unified multimodal embedding
sparse-pairing regime
unpaired modality pairs
emergent alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

EmergentBridge
zero-shot cross-modal transfer
unified multimodal embedding
sparse-pairing regime
orthogonal subspace alignment
🔎 Similar Papers
No similar papers found.