🤖 AI Summary
Existing audio frontends struggle with harmonic feature extraction due to mismatched inductive biases and poor interpretability. Method: This paper introduces the *combolutional layer*—a time-domain harmonic analysis module that integrates a learnable-delay IIR comb filter with envelope detection, implemented entirely in real-valued arithmetic for low parameter count (<1K), high interpretability, and CPU-efficient inference. It serves as a drop-in replacement for standard convolutional layers and is trained end-to-end. Contribution/Results: Evaluated on piano transcription, speaker classification, and key detection, the combolutional layer outperforms mainstream spectrogram-based frontends (e.g., Log-Mel, CQT) in harmonic structure modeling while significantly reducing computational overhead. Its core innovation lies in embedding physically grounded comb filtering—rooted in harmonic resonance principles—into a differentiable neural architecture, thereby unifying efficiency, interpretability, and task-specific adaptability.
📝 Abstract
Selecting appropriate inductive biases is an essential step in the design of machine learning models, especially when working with audio, where even short clips may contain millions of samples. To this end, we propose the combolutional layer: a learned-delay IIR comb filter and fused envelope detector, which extracts harmonic features in the time domain. We demonstrate the efficacy of the combolutional layer on three information retrieval tasks, evaluate its computational cost relative to other audio frontends, and provide efficient implementations for training. We find that the combolutional layer is an effective replacement for convolutional layers in audio tasks where precise harmonic analysis is important, e.g., piano transcription, speaker classification, and key detection. Additionally, the combolutional layer has several other key benefits over existing frontends, namely: low parameter count, efficient CPU inference, strictly real-valued computations, and improved interpretability.