π€ AI Summary
This work proposes a novel sound-triggered mobile manipulation paradigm that addresses the limitations of existing methods reliant on predefined textual instructions, which struggle to adapt to dynamic environments requiring spontaneous interaction. By leveraging environmental sound events as implicit triggers, the agent can proactively respond to sounding objects without explicit commands. To enable this capability, we introduce Habitat-Echo, a simulation platform supporting acoustic rendering and physical interaction, along with an architecture integrating a high-level task planner and a low-level policy model, augmented by multi-source sound separation techniques. Experiments demonstrate that the agent effectively distinguishes between primary and secondary sound sources in complex dual-source scenarios and executes manipulation tasks in the appropriate sequence, thereby validating the robustness and generalization capacity of the proposed approach.
π Abstract
Current mobile manipulation research predominantly follows an instruction-driven paradigm, where agents rely on predefined textual commands to execute tasks. However, this setting confines agents to a passive role, limiting their autonomy and ability to react to dynamic environmental events. To address these limitations, we introduce sound-triggered mobile manipulation, where agents must actively perceive and interact with sound-emitting objects without explicit action instructions. To support these tasks, we develop Habitat-Echo, a data platform that integrates acoustic rendering with physical interaction. We further propose a baseline comprising a high-level task planner and low-level policy models to complete these tasks. Extensive experiments show that the proposed baseline empowers agents to actively detect and respond to auditory events, eliminating the need for case-by-case instructions. Notably, in the challenging dual-source scenario, the agent successfully isolates the primary source from overlapping acoustic interference to execute the first interaction, and subsequently proceeds to manipulate the secondary object, verifying the robustness of the baseline.