HELIOS: Hierarchical Exploration for Language-grounded Interaction in Open Scenes

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing approaches for language-guided mobile manipulation in open-world settings face challenges including poor semantic-perceptual alignment under partial observability, weak dynamic scene knowledge updating, and limited cross-domain generalization. This paper proposes a hierarchical scene representation and active search framework: it integrates 2D semantic occupancy maps with 3D Gaussian object representations to construct a multi-view-consistent, joint semantic-geometric scene representation; and introduces an exploration-exploitation-balanced search objective function enabling language-conditioned active scene reasoning and incremental knowledge refinement. Evaluated in the Habitat simulator, the method achieves zero-shot, language-guided pick-and-place manipulation. It attains state-of-the-art performance on the Open-Vocabulary Mobile Manipulation (OVMM) benchmark and successfully transfers to a real-world office environment, where it is validated on a Spot robot.

Technology Category

Application Category

📝 Abstract
Language-specified mobile manipulation tasks in novel environments simultaneously face challenges interacting with a scene which is only partially observed, grounding semantic information from language instructions to the partially observed scene, and actively updating knowledge of the scene with new observations. To address these challenges, we propose HELIOS, a hierarchical scene representation and associated search objective to perform language specified pick and place mobile manipulation tasks. We construct 2D maps containing the relevant semantic and occupancy information for navigation while simultaneously actively constructing 3D Gaussian representations of task-relevant objects. We fuse observations across this multi-layered representation while explicitly modeling the multi-view consistency of the detections of each object. In order to efficiently search for the target object, we formulate an objective function balancing exploration of unobserved or uncertain regions with exploitation of scene semantic information. We evaluate HELIOS on the OVMM benchmark in the Habitat simulator, a pick and place benchmark in which perception is challenging due to large and complex scenes with comparatively small target objects. HELIOS achieves state-of-the-art results on OVMM. As our approach is zero-shot, HELIOS can also transfer to the real world without requiring additional data, as we illustrate by demonstrating it in a real world office environment on a Spot robot.
Problem

Research questions and friction points this paper is trying to address.

Addresses mobile manipulation in partially observed scenes
Grounds semantic information from language instructions
Balances exploration and exploitation for object search
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical scene representation for mobile manipulation
Multi-layered 2D-3D fusion with multi-view consistency
Balanced exploration-exploitation search objective function
🔎 Similar Papers
No similar papers found.