Shared Multi-modal Embedding Space for Face-Voice Association

📅 2025-12-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses unsupervised cross-modal matching of faces and speech in multilingual settings. Methodologically, it proposes a generalizable multilingual joint embedding framework featuring unimodal feature extractors and a shared projection space; incorporates age-gender priors to facilitate cross-modal alignment; and introduces an adaptive angular margin (AAM) loss to enhance cross-modal embedding consistency. The key contribution is the first demonstration of zero-shot generalization to unseen languages—enabling effective cross-lingual transfer without language-specific fine-tuning. Evaluated on the FAME 2026 Challenge, the method achieves state-of-the-art performance with a mean equal error rate (EER) of 23.99%, demonstrating robustness and practicality in multilingual and low-resource scenarios.

Technology Category

Application Category

📝 Abstract
The FAME 2026 challenge comprises two demanding tasks: training face-voice associations combined with a multilingual setting that includes testing on languages on which the model was not trained. Our approach consists of separate uni-modal processing pipelines with general face and voice feature extraction, complemented by additional age-gender feature extraction to support prediction. The resulting single-modal features are projected into a shared embedding space and trained with an Adaptive Angular Margin (AAM) loss. Our approach achieved first place in the FAME 2026 challenge, with an average Equal-Error Rate (EER) of 23.99%.
Problem

Research questions and friction points this paper is trying to address.

Develops a shared embedding space for face-voice association
Addresses multilingual testing on unseen languages in training
Uses adaptive angular margin loss to improve feature discrimination
Innovation

Methods, ideas, or system contributions that make the work stand out.

Separate uni-modal pipelines for face and voice processing
Projection into shared embedding space with AAM loss
Additional age-gender feature extraction to support prediction
🔎 Similar Papers
No similar papers found.