🤖 AI Summary
This study addresses the challenges of scarce labeled data and the high subjectivity of paralinguistic annotations in speaker confidence detection by proposing a semi-supervised hybrid framework. The approach integrates deep semantic embeddings from the Whisper encoder with interpretable acoustic features—specifically eGeMAPS, vocal stress, and disfluency-derived auxiliary probabilities—and introduces an uncertainty-aware pseudo-labeling strategy that prioritizes high-quality unlabeled samples to enhance training efficacy. Emphasizing data quality over quantity, the method effectively compensates for the lack of confidence-related cues in purely semantic representations by jointly leveraging semantic content and explicit prosodic signals. Experimental results demonstrate that the model achieves a Macro-F1 score of 0.751, significantly outperforming self-supervised baselines such as WavLM, HuBERT, and Wav2Vec 2.0, and yielding a 3% improvement over unimodal Whisper on minority classes.
📝 Abstract
Automatic detection of speaker confidence is critical for adaptive computing but remains constrained by limited labelled data and the subjectivity of paralinguistic annotations. This paper proposes a semi-supervised hybrid framework that fuses deep semantic embeddings from the Whisper encoder with an interpretable acoustic feature vector composed of eGeMAPS descriptors and auxiliary probability estimates of vocal stress and disfluency. To mitigate reliance on scarce ground truth data, we introduce an Uncertainty-Aware Pseudo-Labelling strategy where a model generates labels for unlabelled data, retaining only high-quality samples for training. Experimental results demonstrate that the proposed approach achieves a Macro-F1 score of 0.751, outperforming self-supervised baselines, including WavLM, HuBERT, and Wav2Vec 2.0. The hybrid architecture also surpasses the unimodal Whisper baseline, yielding a 3\% improvement in the minority class, confirming that explicit prosodic and auxiliary features provide necessary corrective signals which are otherwise lost in deep semantic representations. Ablation studies further show that a curated set of high confidence pseudo-labels outperforms indiscriminate large scale augmentation, confirming that data quality outweighs quantity for perceived confidence detection.