🤖 AI Summary
To address weak temporal modeling and inaccurate boundary localization of transient events in self-supervised sound event detection (SED), this paper proposes a Hierarchical Temporal Order Reconstruction (HTOR) pre-training framework. HTOR applies random temporal permutations at both the chunk-level and frame-level, jointly with noise injection, to compel Transformer-based models to explicitly learn temporal order and event structure—overcoming the ambiguous boundary modeling inherent in conventional masked prediction objectives. HTOR is the first work to formulate structured temporal shuffling and reconstruction as a self-supervised pre-training objective. Evaluated on the DESED dataset, HTOR achieves a 5.89% improvement in Polyphonic Sound Detection Score (PSDS), demonstrating substantial gains in fine-grained event onset/offset detection accuracy. This validates that explicit temporal logic modeling serves as a critical performance booster for SED.
📝 Abstract
Sound event detection (SED) has significantly benefited from self-supervised learning (SSL) approaches, particularly masked audio transformer for SED (MAT-SED), which leverages masked block prediction to reconstruct missing audio segments. However, while effective in capturing global dependencies, masked block prediction disrupts transient sound events and lacks explicit enforcement of temporal order, making it less suitable for fine-grained event boundary detection. To address these limitations, we propose JiTTER (Jigsaw Temporal Transformer for Event Reconstruction), an SSL framework designed to enhance temporal modeling in transformer-based SED. JiTTER introduces a hierarchical temporal shuffle reconstruction strategy, where audio sequences are randomly shuffled at both the block-level and frame-level, forcing the model to reconstruct the correct temporal order. This pretraining objective encourages the model to learn both global event structures and fine-grained transient details, improving its ability to detect events with sharp onset-offset characteristics. Additionally, we incorporate noise injection during block shuffle, providing a subtle perturbation mechanism that further regularizes feature learning and enhances model robustness. Experimental results on the DESED dataset demonstrate that JiTTER outperforms MAT-SED, achieving a 5.89% improvement in PSDS, highlighting the effectiveness of explicit temporal reasoning in SSL-based SED. Our findings suggest that structured temporal reconstruction tasks, rather than simple masked prediction, offer a more effective pretraining paradigm for sound event representation learning.