🤖 AI Summary
In real-world distributed multimodal scenarios, modalities are often non-redundant, yet conventional alignment methods rely on the global shared latent space assumption—leading to poor robustness under modality missing and weak zero-shot cross-modal generalization. To address this, we propose SheafAlign: a sheaf-theoretic framework that constructs local comparison spaces between modality pairs, explicitly modeling both shared and modality-specific information while abandoning the global redundancy assumption. It achieves efficient local alignment via decentralized contrastive learning. Theoretically grounded and practically scalable, SheafAlign reduces communication overhead by 50% over state-of-the-art (SOTA) methods. Empirically, it establishes new SOTA performance on both modality-missing robustness and zero-shot cross-modal retrieval tasks. Its design balances theoretical rigor—rooted in algebraic topology—with engineering efficiency for distributed deployment.
📝 Abstract
Conventional multimodal alignment methods assume mutual redundancy across all modalities, an assumption that fails in real-world distributed scenarios. We propose SheafAlign, a sheaf-theoretic framework for decentralized multimodal alignment that replaces single-space alignment with multiple comparison spaces. This approach models pairwise modality relations through sheaf structures and leverages decentralized contrastive learning-based objectives for training. SheafAlign overcomes the limitations of prior methods by not requiring mutual redundancy among all modalities, preserving both shared and unique information. Experiments on multimodal sensing datasets show superior zero-shot generalization, cross-modal alignment, and robustness to missing modalities, with 50% lower communication cost than state-of-the-art baselines.