🤖 AI Summary
To address the challenge of spatial speech understanding in multilingual, noisy environments, this paper proposes the first end-to-end spatial speech translation framework: real-time translation of multi-speaker audio into the user’s native language on binaural wearable devices, while precisely preserving the azimuthal locations and individual speaker identities (voiceprints) of all sources. Methodologically, it pioneers deep integration of spatial awareness—encompassing directional cues and speaker-specific characteristics—into the speech translation pipeline, unifying blind source separation, sound source localization, expressive neural machine translation, and binaural rendering, all deployed with low latency on Apple M2 hardware. Unlike conventional translation systems that discard spatial information, our approach achieves joint spatial–linguistic modeling. Experiments yield a BLEU score of 22.01 under strong real-world interference; user studies confirm accurate spatial positioning and speaker identity preservation of translated speech, even in unseen reverberant conditions.
📝 Abstract
Imagine being in a crowded space where people speak a different language and having hearables that transform the auditory space into your native language, while preserving the spatial cues for all speakers. We introduce spatial speech translation, a novel concept for hearables that translate speakers in the wearer's environment, while maintaining the direction and unique voice characteristics of each speaker in the binaural output. To achieve this, we tackle several technical challenges spanning blind source separation, localization, real-time expressive translation, and binaural rendering to preserve the speaker directions in the translated audio, while achieving real-time inference on the Apple M2 silicon. Our proof-of-concept evaluation with a prototype binaural headset shows that, unlike existing models, which fail in the presence of interference, we achieve a BLEU score of up to 22.01 when translating between languages, despite strong interference from other speakers in the environment. User studies further confirm the system's effectiveness in spatially rendering the translated speech in previously unseen real-world reverberant environments. Taking a step back, this work marks the first step towards integrating spatial perception into speech translation.