Computer Audition: From Task-Specific Machine Learning to Foundation Models

📅 2024-07-22
🏛️ arXiv.org
📈 Citations: 4
Influential: 1
📄 PDF
🤖 AI Summary
To address the longstanding reliance on single-task models and the lack of generalizable representations in computer audition, this paper proposes a systematic framework for constructing Auditory Foundation Models (AFMs). Methodologically, it establishes the core paradigm of AFMs for the first time, integrating unified multi-task modeling, cross-modal (audio–text) aligned representation learning, and instruction-driven human–machine interaction. Technically, the framework encompasses large-scale audio–text contrastive pretraining, multi-task prompt tuning, and self-supervised audio modeling. Experiments demonstrate that the proposed AFM achieves substantial performance gains across 10+ downstream tasks—including automatic speech recognition, sound source separation, and environmental sound classification—while enabling zero-shot transfer and open-domain speech understanding. This work advances computer audition toward generality, multi-task synergy, and natural human–machine interaction.

Technology Category

Application Category

📝 Abstract
Foundation models (FMs) are increasingly spearheading recent advances on a variety of tasks that fall under the purview of computer audition -- the use of machines to understand sounds. They feature several advantages over traditional pipelines: among others, the ability to consolidate multiple tasks in a single model, the option to leverage knowledge from other modalities, and the readily-available interaction with human users. Naturally, these promises have created substantial excitement in the audio community, and have led to a wave of early attempts to build new, general-purpose foundation models for audio. In the present contribution, we give an overview of computational audio analysis as it transitions from traditional pipelines towards auditory foundation models. Our work highlights the key operating principles that underpin those models, and showcases how they can accommodate multiple tasks that the audio community previously tackled separately.
Problem

Research questions and friction points this paper is trying to address.

Transition from task-specific models to auditory foundation models
Consolidate multiple audio tasks into single foundation models
Leverage cross-modal knowledge for general-purpose audio understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Foundation models unify multiple audio tasks
Leverage cross-modal knowledge integration
Transition from task-specific to general-purpose models
🔎 Similar Papers
No similar papers found.
Andreas Triantafyllopoulos
Andreas Triantafyllopoulos
Technical University of Munich
machine learningaffective computingcomputer audition
Iosif Tsangko
Iosif Tsangko
PhD Student, Technische Universität München
Machine LearningDeep LearningSignal ProcessingNatural Language Processing
A
Alexander Gebhard
CHI – Chair of Health Informatics, Technical University of Munich, MRI, Munich, Germany
A
A. Mesaros
Audio Research Group, Tampere University, Tampere, Finland
Tuomas Virtanen
Tuomas Virtanen
Tampere University
machine listeningaudio signal processingaudio
B
Bjorn W. Schuller
EIHW – Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Augsburg, Germany; GLAM – Group on Language, Audio, & Music, Imperial College, London, UK; MCML – Munich Center for Machine Learning, Munich, Germany; MDSI – Munich Data Science Institute, Munich, Germany