🤖 AI Summary
This study addresses the challenge of single-channel speaker distance estimation, which is highly sensitive to room acoustics and relies on poorly understood acoustic cues. The authors propose a novel decomposition of room impulse responses—using the reverberant mixing time estimated from the echo density function as a boundary—to separate the signal into direct sound, early reflections, and late reverberation. Through systematic evaluation across multiple time- and amplitude-calibrated scenarios, they assess the contribution of each component to distance estimation accuracy. Results show that, without time calibration, early reflections serve as the primary cue, achieving a mean absolute error of 1.29 meters, with performance improving as early energy increases and degrading under stronger reverberation. With time calibration, however, propagation delay alone reduces the error dramatically to 0.14 meters.
📝 Abstract
Single-channel speaker distance estimation has recently achieved centimeter-level accuracy in simulated environments, yet it remains unclear which components of the room impulse response (RIR) the model exploits and how performance depends on the recording conditions. In this work, we decompose simulated RIRs into four variants (full, direct-only, no-late, and no-early) using the mixing time estimated from the echo density function as the boundary between early reflections and late reverberation. We define four calibration scenarios, from fully calibrated (synchronised capture, known source level) to fully uncalibrated (arbitrary onset, unknown level), and evaluate all combinations on a matched dataset. Results show that without time calibration, mean absolute error (MAE) increases to $1.29$ m and the model extracts reverberation-based cues, with early reflections emerging as the most informative component. Further analysis against DRR, $C_{50}$, and $T_{60}$ confirms that estimation accuracy improves with stronger early energy and degrades in highly reverberant environments. When time calibration is available, the model achieves a MAE of $0.14$ m by extracting the propagation delay alone, regardless of the RIR content.