🤖 AI Summary
This work addresses the underexplored problem of cross-modal face–voice association modeling in multilingual settings. Methodologically, we propose the first framework tailored to bilingual/multilingual real-world communication scenarios, built upon our newly constructed multilingual audio-visual dataset, MAV-Celeb. Our approach introduces an audio–visual joint representation learning architecture that integrates deep cross-modal matching with language-aware feature alignment. Key contributions include: (1) the first multilingual benchmark for face–voice association—MultiLingual-FaceVoice Benchmark; (2) the release of MAV-Celeb, a high-quality, multilingual dataset annotated with speaker identity and language labels; and (3) reproducible strong baseline models that significantly improve robustness of cross-modal matching under cross-lingual conditions. This work establishes a new paradigm and provides empirical foundations for generalizable cross-modal biometric recognition.
📝 Abstract
The advancements of technology have led to the use of multimodal systems in various real-world applications. Among them, audio-visual systems are among the most widely used multimodal systems. In the recent years, associating face and voice of a person has gained attention due to the presence of unique correlation between them. The Face-voice Association in Multilingual Environments (FAME) 2026 Challenge focuses on exploring face-voice association under the unique condition of a multilingual scenario. This condition is inspired from the fact that half of the world's population is bilingual and most often people communicate under multilingual scenarios. The challenge uses a dataset named Multilingual Audio-Visual (MAV-Celeb) for exploring face-voice association in multilingual environments. This report provides the details of the challenge, dataset, baseline models, and task details for the FAME Challenge.