Unified Vision-Language Modeling via Concept Space Alignment

📅 2026-03-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of achieving efficient cross-lingual and cross-modal (vision–language) alignment within a unified semantic space, with a particular focus on multilingual and low-resource settings. By leveraging a posterior alignment approach, the authors map existing vision encoders into the multilingual text embedding space SONAR, thereby constructing a unified vision–language embedding space termed V-SONAR. Building upon this foundation, they introduce the Vision Large Concept Model (V-LCM), which, for the first time, enables zero-shot understanding of visual concepts across 62 languages using only English training data. V-LCM unifies textual and visual inputs into latent embedding sequences, supporting multilingual cross-modal instruction following and generation. Experimental results demonstrate that V-LCM significantly outperforms state-of-the-art methods in BLEU scores on the DREAM-1K and PE-VIDEO video captioning benchmarks, achieving substantial gains in 61 out of 62 evaluated languages.

Technology Category

Application Category

📝 Abstract
We introduce V-SONAR, a vision-language embedding space extended from the text-only embedding space SONAR (Omnilingual Embeddings Team et al., 2026), which supports 1500 text languages and 177 speech languages. To construct V-SONAR, we propose a post-hoc alignment pipeline that maps the representations of an existing vision encoder into the SONAR space. We thoroughly evaluate V-SONAR and show that its embeddings achieve competitive performance on text-to-video retrieval. Equipped with the OMNISONAR text decoder, V-SONAR further surpasses state-of-the-art vision-language models on video captioning tasks, including DREAM-1K (BLEU 23.9 vs. 19.6) and PE-VIDEO (BLEU 39.0 vs. 30.0). Leveraging V-SONAR, we first demonstrate that the Large Concept Model (LCM; LCM team et al. 2024) operating in SONAR and trained with English text only, can perform both single- and multi-visual concept understanding in a zero-shot manner. Finally, we introduce V-LCM, which extends the LCM with vision-language instruction tuning. V-LCM encodes vision and language inputs into an unified sequence of latent embeddings via V-SONAR and SONAR, and it is trained with the same latent diffusion objective for next-embedding prediction as in LCM's text-only pre-training. Experiments on a large-scale multilingual and -modal instruction-tuning data mixture highlight the potential of V-LCM: V-LCM matches state-of-the-art vision-language models on tasks covering image/video captioning and question answering, while significantly outperforming them across 61 rich- to low-resource languages out of all 62 tested languages.
Problem

Research questions and friction points this paper is trying to address.

vision-language modeling
multilingual
zero-shot learning
concept understanding
cross-modal alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language alignment
post-hoc embedding alignment
multilingual multimodal modeling
latent diffusion objective
zero-shot cross-lingual vision understanding
🔎 Similar Papers