Beyond Fixed Frames: Dynamic Character-Aligned Speech Tokenization

📅 2026-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes DyCAST, a dynamic character-aligned speech tokenizer that addresses the inefficiency of existing neural audio codecs, which rely on fixed frame rates and produce excessively long token sequences. DyCAST is the first to integrate soft character-level alignment with explicit duration modeling, enabling variable frame-rate tokenization. The method supports alignment-free inference and allows direct control over token durations during decoding. Additionally, it incorporates retrieval-augmented decoding to enhance reconstruction fidelity. Experimental results demonstrate that DyCAST substantially reduces the number of tokens while maintaining high-quality speech reconstruction and competitive performance on downstream tasks.

Technology Category

Application Category

📝 Abstract
Neural audio codecs are at the core of modern conversational speech technologies, converting continuous speech into sequences of discrete tokens that can be processed by LLMs. However, existing codecs typically operate at fixed frame rates, allocating tokens uniformly in time and producing unnecessarily long sequences. In this work, we introduce DyCAST, a Dynamic Character-Aligned Speech Tokenizer that enables variable-frame-rate tokenization through soft character-level alignment and explicit duration modeling. DyCAST learns to associate tokens with character-level linguistic units during training and supports alignment-free inference with direct control over token durations at decoding time. To improve speech resynthesis quality at low frame rates, we further introduce a retrieval-augmented decoding mechanism that enhances reconstruction fidelity without increasing bitrate. Experiments show that DyCAST achieves competitive speech resynthesis quality and downstream performance while using significantly fewer tokens than fixed-frame-rate codecs. Code and checkpoints will be released publicly at https://github.com/lucadellalib/dycast.
Problem

Research questions and friction points this paper is trying to address.

neural audio codecs
fixed frame rates
speech tokenization
discrete tokens
character-level alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

variable-frame-rate tokenization
character-aligned speech
duration modeling
retrieval-augmented decoding
neural audio codec
🔎 Similar Papers
No similar papers found.