Shared Multi-modal Embedding Space for Face-Voice Association

📅 2025-12-04

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This paper addresses unsupervised cross-modal matching of faces and speech in multilingual settings. Methodologically, it proposes a generalizable multilingual joint embedding framework featuring unimodal feature extractors and a shared projection space; incorporates age-gender priors to facilitate cross-modal alignment; and introduces an adaptive angular margin (AAM) loss to enhance cross-modal embedding consistency. The key contribution is the first demonstration of zero-shot generalization to unseen languages—enabling effective cross-lingual transfer without language-specific fine-tuning. Evaluated on the FAME 2026 Challenge, the method achieves state-of-the-art performance with a mean equal error rate (EER) of 23.99%, demonstrating robustness and practicality in multilingual and low-resource scenarios.

Technology Category

Application Category

📝 Abstract

The FAME 2026 challenge comprises two demanding tasks: training face-voice associations combined with a multilingual setting that includes testing on languages on which the model was not trained. Our approach consists of separate uni-modal processing pipelines with general face and voice feature extraction, complemented by additional age-gender feature extraction to support prediction. The resulting single-modal features are projected into a shared embedding space and trained with an Adaptive Angular Margin (AAM) loss. Our approach achieved first place in the FAME 2026 challenge, with an average Equal-Error Rate (EER) of 23.99%.

Problem

Research questions and friction points this paper is trying to address.

Develops a shared embedding space for face-voice association

Addresses multilingual testing on unseen languages in training

Uses adaptive angular margin loss to improve feature discrimination

Innovation

Methods, ideas, or system contributions that make the work stand out.

Separate uni-modal pipelines for face and voice processing

Projection into shared embedding space with AAM loss

Additional age-gender feature extraction to support prediction

🔎 Similar Papers

No similar papers found.