🤖 AI Summary
This paper addresses the opacity and unreliability in modeling relationships between acoustic features and affective dimensions (arousal and valence) in soundscape emotion recognition (SER). We propose a graph topology inference framework based on linear structural equation modeling (SEM), which integrates information criteria with a generalized elbow detector to automatically learn sparse directed graphs that reveal causal contributions of features to emotional outputs, while quantifying uncertainty in sparsity selection. Experiments on the Emo-Soundscapes dataset demonstrate that our method significantly improves feature selection accuracy and enables interpretable visualization of feature–emotion relationships. Crucially, it provides the first quantitative evidence of a strong statistical association between arousal and valence—challenging the conventional orthogonality assumption—and establishes a novel, interpretable paradigm for SER modeling.
📝 Abstract
Research on soundscapes has shifted the focus of environmental acoustics from noise levels to the perception of sounds, incorporating contextual factors. Soundscape emotion recognition (SER) models perception using a set of features, with arousal and valence commonly regarded as sufficient descriptors of affect. In this work, we blend emph{graph learning} techniques with a novel emph{information criterion} to develop a feature selection framework for SER. Specifically, we estimate a sparse graph representation of feature relations using linear structural equation models (SEM) tailored to the widely used Emo-Soundscapes dataset. The resulting graph captures the relations between input features and the two emotional outputs. To determine the appropriate level of sparsity, we propose a novel emph{generalized elbow detector}, which provides both a point estimate and an uncertainty interval. We conduct an extensive evaluation of our methods, including visualizations of the inferred relations. While several of our findings align with previous studies, the graph representation also reveals a strong connection between arousal and valence, challenging common SER assumptions.