FOA Tokenizer: Low-bitrate Neural Codec for First Order Ambisonics with Spatial Consistency Loss

📅 2025-10-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses low-bitrate neural coding and decoding of first-order Ambisonics (FOA) spatial audio. We propose the first discrete neural codec specifically designed for four-channel FOA signals. Methodologically, we extend the WavTokenizer architecture by integrating multi-channel modeling, discrete representation learning, and an autoencoder framework, and introduce a novel spatial consistency loss to explicitly enforce fidelity in directional cue reconstruction. At an ultra-low bitrate of 0.9 kbps (75 tokens/s), our method achieves high-fidelity FOA reconstruction, yielding average angular errors of 13.76°, 3.96°, and 25.83° across three evaluation scenarios. Moreover, the learned discrete latent representations demonstrate practical utility for downstream sound source localization on the STARSS23 dataset. To the best of our knowledge, this is the first approach enabling directionally aware, reconstructable FOA encoding at such ultra-low bitrates—thereby filling a critical gap in neural spatial audio compression.

Technology Category

Application Category

📝 Abstract
Neural audio codecs have been widely studied for mono and stereo signals, but spatial audio remains largely unexplored. We present the first discrete neural spatial audio codec for first-order ambisonics (FOA). Building on the WavTokenizer architecture, we extend it to support four-channel FOA signals and introduce a novel spatial consistency loss to preserve directional cues in the reconstructed signals under a highly compressed representation. Our codec compresses 4-channel FOA audio at 24 kHz into 75 discrete tokens per second, corresponding to a bit rate of 0.9 kbps. Evaluations on simulated reverberant mixtures, non-reverberant clean speech, and FOA mixtures with real room impulse responses show accurate reconstruction, with mean angular errors of 13.76°, 3.96°, and 25.83°, respectively, across the three conditions. In addition, discrete latent representations derived from our codec provide useful features for downstream spatial audio tasks, as demonstrated on sound event localization and detection with STARSS23 real recordings.
Problem

Research questions and friction points this paper is trying to address.

Developing a neural codec for first-order ambisonics spatial audio
Preserving directional cues with spatial consistency loss
Enabling low-bitrate compression and downstream task features
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends WavTokenizer to four-channel FOA signals
Introduces spatial consistency loss for directional cues
Compresses FOA audio to 0.9 kbps bitrate
🔎 Similar Papers
No similar papers found.