๐ค AI Summary
This study addresses the challenge of low emotion recognition accuracy in noisy, short-duration (as low as 250 ms) crowd acoustic environments. We conduct the first systematic evaluation of multilingual speech foundation models (SFMs) for crowd emotion recognition (CER), benchmarking them against monolingual and speaker-identification SFMs under controlled conditions. Experiments span three segment durationsโ1 s, 500 ms, and 250 msโon a unified evaluation benchmark. Results demonstrate that multilingual SFMs consistently outperform all baselines across all durations, owing to their robust representations of multilingual content, accent variability, and acoustic noise; gains are especially pronounced for 250 ms segments, where they significantly enhance both robustness and classification accuracy. This work establishes multilingual SFMs as a new paradigm for CER and introduces the first reproducible strong baseline and standardized evaluation benchmark for short-duration crowd emotion analysis.
๐ Abstract
This paper investigates the polyglot (multilingual) speech foundation models (SFMs) for Crowd Emotion Recognition (CER). We hypothesize that polyglot SFMs, pre-trained on diverse languages, accents, and speech patterns, are particularly adept at navigating the noisy and complex acoustic environments characteristic of crowd settings, thereby offering a significant advantage for CER. To substantiate this, we perform a comprehensive analysis, comparing polyglot, monolingual, and speaker recognition SFMs through extensive experiments on a benchmark CER dataset across varying audio durations (1 sec, 500 ms, and 250 ms). The results consistently demonstrate the superiority of polyglot SFMs, outperforming their counterparts across all audio lengths and excelling even with extremely short-duration inputs. These findings pave the way for adaptation of SFMs in setting up new benchmarks for CER.