Towards Language-Independent Face-Voice Association with Multimodal Foundation Models

πŸ“… 2025-12-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenging cross-modal face-voice verification problem under unseen language conditions. Methodologically, we propose the first foundation-model-based multilingual generalization framework: an ImageBind-LoRA dual-encoder architecture integrating contrastive learning, orthogonal projection loss, and LoRA-based low-rank fine-tuning, trained on our newly curated Arabic VoxBlink datasetβ€”the first adaptation of ImageBind to cross-lingual audiovisual association. Empirically, the model trained exclusively on Arabic achieves an EER of 24.73% on English and German test sets, substantially outperforming all baselines. It ranked second in the FAME2026 Challenge, demonstrating strong zero-shot cross-lingual generalization and practical deployability. Our approach establishes a new foundation for language-agnostic biometric authentication and advances the use of multimodal foundation models in low-resource linguistic settings.

Technology Category

Application Category

πŸ“ Abstract
This paper describes the UZH-CL system submitted to the FAME2026 Challenge. The challenge focuses on cross-modal verification under unique multilingual conditions, specifically unseen and unheard languages. Our approach investigates two distinct architectures, consisting of a baseline dual-encoder system trained from scratch using contrastive and orthogonal projection losses, and a foundation model approach leveraging ImageBind with LoRA. To address the data scarcity and language constraints of the challenge, we curated an external Arabic dataset from VoxBlink. Our best-performing system, ImageBind-LoRA, demonstrates remarkable cross-lingual generalization: despite being fine-tuned exclusively on Arabic audio, it achieved an EER of 24.73% on the evaluation set (English and German), securing 2nd place in the competition.
Problem

Research questions and friction points this paper is trying to address.

Cross-modal verification in multilingual unseen languages
Addressing data scarcity with external Arabic dataset curation
Enhancing cross-lingual generalization using ImageBind-LoRA architecture
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraged ImageBind with LoRA for cross-modal association
Used contrastive and orthogonal projection losses in training
Curated external Arabic dataset to overcome data scarcity
πŸ”Ž Similar Papers
No similar papers found.