Topological Alignment of Shared Vision-Language Embedding Space

📅 2025-10-12

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Existing multilingual vision-language models (VLMs) are constrained by English-centric multimodal data and rely solely on instance-level cross-modal alignment, neglecting the global geometric structure of embedding spaces. To address this, we propose ToMCLIP—a novel framework that introduces persistent homology into multilingual VLMs for the first time. It employs a topology-aware alignment mechanism to explicitly model the global topological structure shared between visual and multilingual textual embeddings. Coupled with a graph sparsification strategy, it enables efficient approximation of topological features within a theoretically guaranteed error bound. Crucially, ToMCLIP enforces topological consistency explicitly in the shared embedding space, thereby enhancing both structural coherence and robustness of semantic alignment. Experiments demonstrate substantial improvements: zero-shot classification accuracy on CIFAR-100 increases significantly, and multilingual image–text retrieval performance on xFlickr&CO markedly surpasses state-of-the-art baselines.

Technology Category

Application Category

📝 Abstract

Contrastive Vision-Language Models (VLMs) have demonstrated strong zero-shot capabilities. However, their cross-modal alignment remains biased toward English due to limited multilingual multimodal data. Recent multilingual extensions have alleviated this gap but enforce instance-level alignment while neglecting the global geometry of the shared embedding space. We address this problem by introducing ToMCLIP (Topological Alignment for Multilingual CLIP), a topology-aware framework aligning embedding spaces with topology-preserving constraints. The proposed method applies persistent homology to define a topological alignment loss and approximates persistence diagram with theoretical error bounds using graph sparsification strategy. This work validates the proposed approach, showing enhanced structural coherence of multilingual representations, higher zero-shot accuracy on the CIFAR-100, and stronger multilingual retrieval performance on the xFlickr&CO. Beyond VLMs, the proposed approach provides a general method for incorporating topological alignment into representation learning.

Problem

Research questions and friction points this paper is trying to address.

Addressing English bias in multilingual vision-language model alignment

Improving global geometry of shared embedding space topology

Enhancing multilingual representation coherence and cross-modal retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

Topological alignment loss with persistent homology

Graph sparsification for persistence diagram approximation

Topology-preserving constraints for multilingual embedding spaces

🔎 Similar Papers

Law of Vision Representation in MLLMs