Towards Language-Independent Face-Voice Association with Multimodal Foundation Models

📅 2025-12-02

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

This work addresses the challenging cross-modal face-voice verification problem under unseen language conditions. Methodologically, we propose the first foundation-model-based multilingual generalization framework: an ImageBind-LoRA dual-encoder architecture integrating contrastive learning, orthogonal projection loss, and LoRA-based low-rank fine-tuning, trained on our newly curated Arabic VoxBlink dataset—the first adaptation of ImageBind to cross-lingual audiovisual association. Empirically, the model trained exclusively on Arabic achieves an EER of 24.73% on English and German test sets, substantially outperforming all baselines. It ranked second in the FAME2026 Challenge, demonstrating strong zero-shot cross-lingual generalization and practical deployability. Our approach establishes a new foundation for language-agnostic biometric authentication and advances the use of multimodal foundation models in low-resource linguistic settings.

Technology Category

Application Category

📝 Abstract

This paper describes the UZH-CL system submitted to the FAME2026 Challenge. The challenge focuses on cross-modal verification under unique multilingual conditions, specifically unseen and unheard languages. Our approach investigates two distinct architectures, consisting of a baseline dual-encoder system trained from scratch using contrastive and orthogonal projection losses, and a foundation model approach leveraging ImageBind with LoRA. To address the data scarcity and language constraints of the challenge, we curated an external Arabic dataset from VoxBlink. Our best-performing system, ImageBind-LoRA, demonstrates remarkable cross-lingual generalization: despite being fine-tuned exclusively on Arabic audio, it achieved an EER of 24.73% on the evaluation set (English and German), securing 2nd place in the competition.

Problem

Research questions and friction points this paper is trying to address.

Cross-modal verification in multilingual unseen languages

Addressing data scarcity with external Arabic dataset curation

Enhancing cross-lingual generalization using ImageBind-LoRA architecture

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraged ImageBind with LoRA for cross-modal association

Used contrastive and orthogonal projection losses in training

Curated external Arabic dataset to overcome data scarcity

🔎 Similar Papers

No similar papers found.

Microsoft

$6,710 -

San Francisco Bay area / New York City metropolitan area

Research Scientist Intern, Multimodal AI (PhD)