🤖 AI Summary
This work proposes a lightweight, end-to-end co-trained neural communication system tailored for robot-to-robot speech interaction in acoustically challenging environments where channel distortions such as noise severely degrade recognition accuracy. Designed specifically for scenarios that do not require prosody or voice identity, the system integrates a 1.18M-parameter text-to-speech (TTS) transmitter and a 938K-parameter Conformer-based automatic speech recognition (ASR) receiver. A differentiable channel model is incorporated to enable the TTS module to learn distortion-robust acoustic representations through a three-stage co-training curriculum, eliminating reliance on conventional handcrafted signal processing. Experimental results demonstrate a word error rate of 8.3% at 0 dB SNR, with a total model size of only 2.1 million parameters (8.4 MB) and end-to-end CPU latency under 13 milliseconds, achieving an exceptional balance between robustness and computational efficiency.
📝 Abstract
We present Artoo, a learned acoustic communication system for robots that replaces hand-designed signal processing with end-to-end co-trained neural networks. Our system pairs a lightweight text-to-speech (TTS) transmitter (1.18M parameters) with a conformer-based automatic speech recognition (ASR) receiver (938K parameters), jointly optimized through a differentiable channel. Unlike human speech, robot-to-robot communication is paralinguistics-free: the system need not preserve timbre, prosody, or naturalness, only maximize decoding accuracy under channel distortion. Through a three-phase co-training curriculum, the TTS transmitter learns to produce distortion-robust acoustic encodings that surpass the baseline under noise, achieving 8.3% CER at 0 dB SNR. The entire system requires only 2.1M parameters (8.4 MB) and runs in under 13 ms end-to-end on a CPU, making it suitable for deployment on resource-constrained robotic platforms.