The Talking Robot: Distortion-Robust Acoustic Models for Robot-Robot Communication

📅 2026-03-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a lightweight, end-to-end co-trained neural communication system tailored for robot-to-robot speech interaction in acoustically challenging environments where channel distortions such as noise severely degrade recognition accuracy. Designed specifically for scenarios that do not require prosody or voice identity, the system integrates a 1.18M-parameter text-to-speech (TTS) transmitter and a 938K-parameter Conformer-based automatic speech recognition (ASR) receiver. A differentiable channel model is incorporated to enable the TTS module to learn distortion-robust acoustic representations through a three-stage co-training curriculum, eliminating reliance on conventional handcrafted signal processing. Experimental results demonstrate a word error rate of 8.3% at 0 dB SNR, with a total model size of only 2.1 million parameters (8.4 MB) and end-to-end CPU latency under 13 milliseconds, achieving an exceptional balance between robustness and computational efficiency.

Technology Category

Application Category

📝 Abstract
We present Artoo, a learned acoustic communication system for robots that replaces hand-designed signal processing with end-to-end co-trained neural networks. Our system pairs a lightweight text-to-speech (TTS) transmitter (1.18M parameters) with a conformer-based automatic speech recognition (ASR) receiver (938K parameters), jointly optimized through a differentiable channel. Unlike human speech, robot-to-robot communication is paralinguistics-free: the system need not preserve timbre, prosody, or naturalness, only maximize decoding accuracy under channel distortion. Through a three-phase co-training curriculum, the TTS transmitter learns to produce distortion-robust acoustic encodings that surpass the baseline under noise, achieving 8.3% CER at 0 dB SNR. The entire system requires only 2.1M parameters (8.4 MB) and runs in under 13 ms end-to-end on a CPU, making it suitable for deployment on resource-constrained robotic platforms.
Problem

Research questions and friction points this paper is trying to address.

robot-robot communication
acoustic models
distortion robustness
speech recognition
text-to-speech
Innovation

Methods, ideas, or system contributions that make the work stand out.

distortion-robust
end-to-end co-training
robot-robot communication
lightweight TTS-ASR
differentiable channel
🔎 Similar Papers
No similar papers found.
H
Hanlong Li
Institute of Science Tokyo
K
Karishma Kamalahasan
Georgia Institute of Technology
J
Jiahui Li
Institute of Science Tokyo
Kazuhiro Nakadai
Kazuhiro Nakadai
Institute of Science Tokyo
Robot Audition and Scene AnalysisArtificial IntelligenceSignal and Speech ProcessingRobotics
Shreyas Kousik
Shreyas Kousik
Georgia Institute of Technology
robotics