Unified Learnable 2D Convolutional Feature Extraction for ASR

📅 2025-09-12

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Current ASR front-ends predominantly rely on hand-crafted or hybrid modules, suffering from strong inductive biases, architectural fragmentation, and low parameter efficiency. To address these limitations, we propose a unified, end-to-end learnable 2D convolutional frontend that entirely replaces conventional filter banks and multi-source topologies with a pure, trainable 2D CNN for acoustic feature extraction. Our design incorporates no pretrained large models or hand-engineered priors, thereby substantially reducing inductive bias while enhancing generalization and deployment flexibility. Experiments demonstrate state-of-the-art performance among supervised learnable frontends across major benchmarks, alongside a 30–50% reduction in parameter count—making it especially suitable for low-resource scenarios. The core contribution lies in the first realization of a fully convolutional, monolithic, and end-to-end learnable ASR frontend architecture.

Technology Category

Application Category

📝 Abstract

Neural front-ends represent a promising approach to feature extraction for automatic speech recognition (ASR) systems as they enable to learn specifically tailored features for different tasks. Yet, many of the existing techniques remain heavily influenced by classical methods. While this inductive bias may ease the system design, our work aims to develop a more generic front-end for feature extraction. Furthermore, we seek to unify the front-end architecture contrasting with existing approaches that apply a composition of several layer topologies originating from different sources. The experiments systematically show how to reduce the influence of existing techniques to achieve a generic front-end. The resulting 2D convolutional front-end is parameter-efficient and suitable for a scenario with limited computational resources unlike large models pre-trained on unlabeled audio. The results demonstrate that this generic unified approach is not only feasible but also matches the performance of existing supervised learnable feature extractors.

Problem

Research questions and friction points this paper is trying to address.

Developing a generic neural front-end for ASR feature extraction

Unifying front-end architecture instead of combining different layer topologies

Creating parameter-efficient 2D convolutional feature extraction for limited resources

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified 2D convolutional front-end architecture

Parameter-efficient design for limited resources

Learns task-specific features without classical methods

🔎 Similar Papers

No similar papers found.