π€ AI Summary
Existing language modelβbased approaches to lossless audio compression are constrained to 8-bit audio and struggle to scale to high-fidelity 16- or 24-bit scenarios. This work proposes Trilobyte, a byte-level tokenization method that reduces the vocabulary size from exponential O(2^b) to constant O(1), enabling, for the first time, scalable autoregressive language modeling of full-resolution 24-bit audio. Operating directly on raw waveforms, the method is systematically evaluated across diverse audio domains, including music, speech, and bioacoustics. Experiments demonstrate that Trilobyte outperforms FLAC in compression efficiency on both 8-bit and 16-bit audio; while gains on 24-bit audio are more modest, they nonetheless reveal both the potential and current limitations of language models for high-fidelity audio compression.
π Abstract
Autoregressive"language"models (LMs) trained on raw waveforms can be repurposed for lossless audio compression, but prior work is limited to 8-bit audio, leaving open whether such approaches work for practical settings (16/24-bit) and can compete with existing codecs. We benchmark LM-based compression on full-fidelity audio across diverse domains (music, speech, bioacoustics), sampling rates (16kHz-48kHz), and bit depths (8, 16, 24-bit). Standard sample-level tokenization becomes intractable at higher bit depths due to vocabulary size (65K for 16-bit; 16.7M for 24-bit). We propose Trilobyte, a byte-level tokenization schema for full resolution audio, improving vocabulary scaling from $O(2^{b})$ to $O(1)$ and enabling the first tractable 24-bit LM-based lossless compression. While LMs consistently outperform FLAC and yield state-of-the-art compression at 8-bit and 16-bit, we observe that compression gains become more modest as bit depth increases beyond 8-bit.