Problem
Research questions and friction points this paper is trying to address.
Develops a shared embedding space for face-voice association
Addresses multilingual testing on unseen languages in training
Uses adaptive angular margin loss to improve feature discrimination
Innovation
Methods, ideas, or system contributions that make the work stand out.
Separate uni-modal pipelines for face and voice processing
Projection into shared embedding space with AAM loss
Additional age-gender feature extraction to support prediction