Lend me an Ear: Speech Enhancement Using a Robotic Arm with a Microphone Array

📅 2026-02-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes an adaptive speech enhancement system that integrates robotic motion control with audio signal processing to address the limitations of conventional methods in high-noise industrial environments. The system employs a seven-degree-of-freedom robotic arm equipped with a 16-microphone array, dynamically reconfiguring the array geometry to optimize speech capture from a target speaker. By synergistically combining sound source localization, computer vision, inverse kinematics, minimum variance distortionless response (MVDR) beamforming, and a deep time-frequency masking network, the approach achieves superior speech acquisition performance. Notably, it introduces— for the first time—a physically reconfigurable microphone array mechanism, significantly outperforming fixed-array configurations across various signal-to-noise ratios. Experimental results demonstrate consistent improvements in scale-invariant signal-to-interference ratio and substantial reductions in word error rate.

Technology Category

Application Category

📝 Abstract
Speech enhancement performance degrades significantly in noisy environments, limiting the deployment of speech-controlled technologies in industrial settings, such as manufacturing plants. Existing speech enhancement solutions primarly rely on advanced digital signal processing techniques, deep learning methods, or complex software optimization techniques. This paper introduces a novel enhancement strategy that incorporates a physical optimization stage by dynamically modifying the geometry of a microphone array to adapt to changing acoustic conditions. A sixteen-microphone array is mounted on a robotic arm manipulator with seven degrees of freedom, with microphones divided into four groups of four, including one group positioned near the end-effector. The system reconfigures the array by adjusting the manipulator joint angles to place the end-effector microphones closer to the target speaker, thereby improving the reference signal quality. This proposed method integrates sound source localization techniques, computer vision, inverse kinematics, minimum variance distortionless response beamformer and time-frequency masking using a deep neural network. Experimental results demonstrate that this approach outperforms other traditional recording configruations, achieving higher scale-invariant signal-to-distortion ratio and lower word error rate accross multiple input signal-to-noise ratio conditions.
Problem

Research questions and friction points this paper is trying to address.

speech enhancement
noisy environments
industrial settings
microphone array
acoustic conditions
Innovation

Methods, ideas, or system contributions that make the work stand out.

robotic microphone array
physical reconfiguration
speech enhancement
adaptive beamforming
deep neural network
🔎 Similar Papers
No similar papers found.
Z
Zachary Turcotte
Department of Electrical and Computer Engineering, Université de Sherbrooke, Québec, Canada
François Grondin
François Grondin
Associate Professor, Université de Sherbrooke
microphone arraydistant speech recognitionrobot auditionsound source localizationsound