🤖 AI Summary
To address poor generalization in cross-lingual face-voice identity matching—caused by linguistic mismatches between training and test languages—this paper proposes a language-agnostic cross-modal joint embedding framework. It jointly employs contrastive learning and cross-modal adapters to extract language-invariant speech representations and disentangled identity-specific facial features, thereby aligning voiceprints and faces within a unified semantic space. We introduce FAME (Face-Audio Multilingual Evaluation), the first systematic cross-lingual voiceprint-face matching benchmark, establishing a new paradigm for language-agnostic biometric alignment. In the FAME 2026 Challenge, our method achieves a 23.6% relative accuracy improvement over baselines under zero-shot cross-lingual transfer, significantly outperforming existing approaches. This demonstrates superior generalization capability and strong potential for real-world deployment.
📝 Abstract
Over half of the world's population is bilingual and people often communicate under multilingual scenarios. The Face-Voice Association in Multilingual Environments (FAME) 2026 Challenge, held at ICASSP 2026, focuses on developing methods for face-voice association that are effective when the language at test-time is different than the training one. This report provides a brief summary of the challenge.