🤖 AI Summary
Existing multilingual vision-language models (VLMs) are constrained by English-centric multimodal data and rely solely on instance-level cross-modal alignment, neglecting the global geometric structure of embedding spaces. To address this, we propose ToMCLIP—a novel framework that introduces persistent homology into multilingual VLMs for the first time. It employs a topology-aware alignment mechanism to explicitly model the global topological structure shared between visual and multilingual textual embeddings. Coupled with a graph sparsification strategy, it enables efficient approximation of topological features within a theoretically guaranteed error bound. Crucially, ToMCLIP enforces topological consistency explicitly in the shared embedding space, thereby enhancing both structural coherence and robustness of semantic alignment. Experiments demonstrate substantial improvements: zero-shot classification accuracy on CIFAR-100 increases significantly, and multilingual image–text retrieval performance on xFlickr&CO markedly surpasses state-of-the-art baselines.
📝 Abstract
Contrastive Vision-Language Models (VLMs) have demonstrated strong zero-shot capabilities. However, their cross-modal alignment remains biased toward English due to limited multilingual multimodal data. Recent multilingual extensions have alleviated this gap but enforce instance-level alignment while neglecting the global geometry of the shared embedding space. We address this problem by introducing ToMCLIP (Topological Alignment for Multilingual CLIP), a topology-aware framework aligning embedding spaces with topology-preserving constraints. The proposed method applies persistent homology to define a topological alignment loss and approximates persistence diagram with theoretical error bounds using graph sparsification strategy. This work validates the proposed approach, showing enhanced structural coherence of multilingual representations, higher zero-shot accuracy on the CIFAR-100, and stronger multilingual retrieval performance on the xFlickr&CO. Beyond VLMs, the proposed approach provides a general method for incorporating topological alignment into representation learning.