Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics

📅 2025-03-03

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work identifies a critical deficiency in audio foundation models (FMs): their inability to accurately perceive and predict turn-taking cues—such as timely utterance initiation, yielding, and responsive alignment—in natural spoken dialogue, leading to excessive speech overlap or prolonged silence. To address this, we introduce the first dedicated dynamic benchmark for evaluating turn-taking behavior in speech FMs, proposing a human-annotated, supervision-based discriminative model as an automated adjudicator—a novel evaluation paradigm. We conduct the first systematic assessment of both open- and closed-source audio FMs, integrating diverse conversational data (e.g., Switchboard), an API-driven black-box evaluation framework, and a human-in-the-loop validation protocol. Experiments reveal pervasive issues across models, including erroneous turn segmentation, premature interruption, and absence of echoic feedback. We publicly release a fully reproducible evaluation platform and delineate concrete directions for improvement.

Technology Category

Application Category

📝 Abstract

The recent wave of audio foundation models (FMs) could provide new capabilities for conversational modeling. However, there have been limited efforts to evaluate these audio FMs comprehensively on their ability to have natural and interactive conversations. To engage in meaningful conversation with the end user, we would want the FMs to additionally perform a fluent succession of turns without too much overlapping speech or long stretches of silence. Inspired by this, we ask whether the recently proposed audio FMs can understand, predict, and perform turn-taking events? To answer this, we propose a novel evaluation protocol that can assess spoken dialog system's turn-taking capabilities using a supervised model as a judge that has been trained to predict turn-taking events in human-human conversations. Using this protocol, we present the first comprehensive user study that evaluates existing spoken dialogue systems on their ability to perform turn-taking events and reveal many interesting insights, such as they sometimes do not understand when to speak up, can interrupt too aggressively and rarely backchannel. We further evaluate multiple open-source and proprietary audio FMs accessible through APIs on carefully curated test benchmarks from Switchboard to measure their ability to understand and predict turn-taking events and identify significant room for improvement. We will open source our evaluation platform to promote the development of advanced conversational AI systems.

Problem

Research questions and friction points this paper is trying to address.

Evaluate audio foundation models on turn-taking dynamics

Assess spoken dialogue systems' turn-taking capabilities

Identify improvements in conversational AI turn-taking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel evaluation protocol for turn-taking

Supervised model predicts human turn-taking

Comprehensive user study on dialogue systems

🔎 Similar Papers

NaturalTurn: A Method to Segment Transcripts into Naturalistic Conversational Turns

2024-03-22arXiv.orgCitations: 0

💼 Related Jobs

Member of Technical Staff - Voice Model

xAI

$150,000 - $450,000 USD

Palo Alto, CA / Palo Alto, CA, Palo Alto, California, United States

Research Scientist Intern, Multimodal AI (PhD)