🤖 AI Summary
This work proposes DyCAST, a dynamic character-aligned speech tokenizer that addresses the inefficiency of existing neural audio codecs, which rely on fixed frame rates and produce excessively long token sequences. DyCAST is the first to integrate soft character-level alignment with explicit duration modeling, enabling variable frame-rate tokenization. The method supports alignment-free inference and allows direct control over token durations during decoding. Additionally, it incorporates retrieval-augmented decoding to enhance reconstruction fidelity. Experimental results demonstrate that DyCAST substantially reduces the number of tokens while maintaining high-quality speech reconstruction and competitive performance on downstream tasks.
📝 Abstract
Neural audio codecs are at the core of modern conversational speech technologies, converting continuous speech into sequences of discrete tokens that can be processed by LLMs. However, existing codecs typically operate at fixed frame rates, allocating tokens uniformly in time and producing unnecessarily long sequences. In this work, we introduce DyCAST, a Dynamic Character-Aligned Speech Tokenizer that enables variable-frame-rate tokenization through soft character-level alignment and explicit duration modeling. DyCAST learns to associate tokens with character-level linguistic units during training and supports alignment-free inference with direct control over token durations at decoding time. To improve speech resynthesis quality at low frame rates, we further introduce a retrieval-augmented decoding mechanism that enhances reconstruction fidelity without increasing bitrate. Experiments show that DyCAST achieves competitive speech resynthesis quality and downstream performance while using significantly fewer tokens than fixed-frame-rate codecs. Code and checkpoints will be released publicly at https://github.com/lucadellalib/dycast.