π€ AI Summary
This work addresses the challenging cross-modal face-voice verification problem under unseen language conditions. Methodologically, we propose the first foundation-model-based multilingual generalization framework: an ImageBind-LoRA dual-encoder architecture integrating contrastive learning, orthogonal projection loss, and LoRA-based low-rank fine-tuning, trained on our newly curated Arabic VoxBlink datasetβthe first adaptation of ImageBind to cross-lingual audiovisual association. Empirically, the model trained exclusively on Arabic achieves an EER of 24.73% on English and German test sets, substantially outperforming all baselines. It ranked second in the FAME2026 Challenge, demonstrating strong zero-shot cross-lingual generalization and practical deployability. Our approach establishes a new foundation for language-agnostic biometric authentication and advances the use of multimodal foundation models in low-resource linguistic settings.
π Abstract
This paper describes the UZH-CL system submitted to the FAME2026 Challenge. The challenge focuses on cross-modal verification under unique multilingual conditions, specifically unseen and unheard languages. Our approach investigates two distinct architectures, consisting of a baseline dual-encoder system trained from scratch using contrastive and orthogonal projection losses, and a foundation model approach leveraging ImageBind with LoRA. To address the data scarcity and language constraints of the challenge, we curated an external Arabic dataset from VoxBlink. Our best-performing system, ImageBind-LoRA, demonstrates remarkable cross-lingual generalization: despite being fine-tuned exclusively on Arabic audio, it achieved an EER of 24.73% on the evaluation set (English and German), securing 2nd place in the competition.