Investigating Polyglot Speech Foundation Models for Learning Collective Emotion from Crowds

📅 2025-09-19

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This study addresses the challenge of low emotion recognition accuracy in noisy, short-duration (as low as 250 ms) crowd acoustic environments. We conduct the first systematic evaluation of multilingual speech foundation models (SFMs) for crowd emotion recognition (CER), benchmarking them against monolingual and speaker-identification SFMs under controlled conditions. Experiments span three segment durations—1 s, 500 ms, and 250 ms—on a unified evaluation benchmark. Results demonstrate that multilingual SFMs consistently outperform all baselines across all durations, owing to their robust representations of multilingual content, accent variability, and acoustic noise; gains are especially pronounced for 250 ms segments, where they significantly enhance both robustness and classification accuracy. This work establishes multilingual SFMs as a new paradigm for CER and introduces the first reproducible strong baseline and standardized evaluation benchmark for short-duration crowd emotion analysis.

Technology Category

Application Category

📝 Abstract

This paper investigates the polyglot (multilingual) speech foundation models (SFMs) for Crowd Emotion Recognition (CER). We hypothesize that polyglot SFMs, pre-trained on diverse languages, accents, and speech patterns, are particularly adept at navigating the noisy and complex acoustic environments characteristic of crowd settings, thereby offering a significant advantage for CER. To substantiate this, we perform a comprehensive analysis, comparing polyglot, monolingual, and speaker recognition SFMs through extensive experiments on a benchmark CER dataset across varying audio durations (1 sec, 500 ms, and 250 ms). The results consistently demonstrate the superiority of polyglot SFMs, outperforming their counterparts across all audio lengths and excelling even with extremely short-duration inputs. These findings pave the way for adaptation of SFMs in setting up new benchmarks for CER.

Problem

Research questions and friction points this paper is trying to address.

Investigating polyglot speech models for crowd emotion recognition

Comparing model performance across different audio durations

Evaluating robustness with extremely short-duration speech inputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Polyglot speech models handle multilingual crowd emotions

Pre-trained on diverse languages and speech patterns

Outperform monolingual models across short audio durations

🔎 Similar Papers

No similar papers found.