Beyond Transcripts: A Renewed Perspective on Audio Chaptering

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the limitations of existing audio segmentation approaches, which heavily rely on textual transcriptions while neglecting intrinsic audio signals, the impact of ASR errors, and transcription-free evaluation protocols. To overcome these issues, the authors propose AudioSeg, a purely audio-based segmentation model, and conduct a systematic comparison among text-based models, acoustic features, AudioSeg, and multimodal large language models (MLLMs). They further introduce a novel transcription-free evaluation framework based on temporal alignment. Experimental results demonstrate that AudioSeg significantly outperforms text-dependent methods, with silent pauses emerging as the strongest acoustic cue. Although MLLMs are constrained by context length limitations, they show promise on short audio segments. This study establishes the first transcription-independent evaluation benchmark for audio segmentation and provides in-depth analysis of the interplay among transcription quality, acoustic properties, and model performance.

Technology Category

Application Category

📝 Abstract

Audio chaptering, the task of automatically segmenting long-form audio into coherent sections, is increasingly important for navigating podcasts, lectures, and videos. Despite its relevance, research remains limited and text-based, leaving key questions unresolved about leveraging audio information, handling ASR errors, and transcript-free evaluation. We address these gaps through three contributions: (1) a systematic comparison between text-based models with acoustic features, a novel audio-only architecture (AudioSeg) operating on learned audio representations, and multimodal LLMs; (2) empirical analysis of factors affecting performance, including transcript quality, acoustic features, duration, and speaker composition; and (3) formalized evaluation protocols contrasting transcript-dependent text-space protocols with transcript-invariant time-space protocols. Our experiments on YTSeg reveal that AudioSeg substantially outperforms text-based approaches, pauses provide the largest acoustic gains, and MLLMs remain limited by context length and weak instruction following, yet MLLMs are promising on shorter audio.

Problem

Research questions and friction points this paper is trying to address.

audio chaptering

ASR errors

transcript-free evaluation

acoustic features

long-form audio segmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Audio Chaptering

AudioSeg

Acoustic Features