LHGNN: Local-Higher Order Graph Neural Networks For Audio Classification and Tagging

📅 2025-01-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio Transformer models capture only pairwise acoustic relationships, limiting their ability to identify distinct sound objects—especially under low-shot conditions. To address this, we propose HG-GraphNet, a graph neural network for audio classification and tagging that jointly encodes local neighborhood graphs and high-order hypergraphs derived from fuzzy C-means clustering. Our approach is the first to integrate local binary relations with fuzzy-clustering-induced high-order cliques, eliminating the need for ImageNet pretraining. Evaluated on three public audio benchmarks, HG-GraphNet consistently outperforms Transformer baselines while reducing parameter count by 30–50%. Crucially, it achieves greater gains in low-resource settings, yielding an average +2.8% improvement in mean Average Precision (mAP). These results demonstrate the effectiveness and generalization advantage of explicitly modeling high-order acoustic object structures.

Technology Category

Application Category

📝 Abstract
Transformers have set new benchmarks in audio processing tasks, leveraging self-attention mechanisms to capture complex patterns and dependencies within audio data. However, their focus on pairwise interactions limits their ability to process the higher-order relations essential for identifying distinct audio objects. To address this limitation, this work introduces the Local- Higher Order Graph Neural Network (LHGNN), a graph based model that enhances feature understanding by integrating local neighbourhood information with higher-order data from Fuzzy C-Means clusters, thereby capturing a broader spectrum of audio relationships. Evaluation of the model on three publicly available audio datasets shows that it outperforms Transformer-based models across all benchmarks while operating with substantially fewer parameters. Moreover, LHGNN demonstrates a distinct advantage in scenarios lacking ImageNet pretraining, establishing its effectiveness and efficiency in environments where extensive pretraining data is unavailable.
Problem

Research questions and friction points this paper is trying to address.

Transformer-based models
Audio processing
Limited data performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

LHGNN
Complex Sound Object Relationship
Efficient Parameter Utilization
S
Shubhr Singh
School of Electronic Engineering and Computer Science, Queen Mary University of London, UK
Emmanouil Benetos
Emmanouil Benetos
Queen Mary University of London
Machine listeningAudio signal processingMusic information retrievalMachine learning
H
Huy Phan
Meta, 75002 Paris, France
Dan Stowell
Dan Stowell
Tilburg University / Naturalis Biodiversity Centre