π€ AI Summary
This work addresses the attention bottleneck in long-form audio processing, where background noise often dominates and obscures rare salient events. To tackle this issue, the authors propose a training-free neuro-auditory cognitive architecture that reframes attention allocation as an auditory saliency filtering task. Inspired by neuroscience, the framework employs an oscillatory working memory (OWM) mechanism to maintain attractor-like states and enable adaptive saliency-triggered activation of a high-level audio language model (ALM), which is engaged only when perceptually salient content is detected. By integrating energy fluctuation analysis with attention gating, the architecture seamlessly interfaces with existing ALMs. Experiments demonstrate that the approach boosts AudioQwenβs mean average precision on the XD-Violence dataset from 53.50% to 70.60% and effectively captures novel events and subcategory shifts in the USoW dataset while substantially suppressing environmental noise interference.
π Abstract
Audio provides critical situational cues, yet current Audio Language Models (ALMs) face an attention bottleneck in long-form recordings where dominant background patterns can dilute rare, salient events. We introduce NAACA, a training-free NeuroAuditory Attentive Cognitive Architecture that reframes attention allocation as an auditory salience filtering problem. At its core is OWM, a neuro-inspired Oscillatory Working Memory that maintains stable attractor-like states and triggers higher-cognition ALM processing only when adaptive energy fluctuations signal perceptual salience, triggering higher-level reasoning. On XD-Violence, NAACA improves AudioQwen's average precision (AP) from 53.50% to 70.60% while reducing unnecessary ALM invocations. Furthermore, qualitative case studies on the Urban Soundscapes of the World (USoW) dataset show that OWM captures novel events and subcategory shifts while remaining robust to transient pauses and ambient urban noise.