🤖 AI Summary
This work addresses the limitation of existing object navigation benchmarks, which rely on explicit target categories and struggle to handle human instructions expressed through implicit intentions (e.g., “the room feels stuffy”). To bridge this gap, we introduce IntentionNav—the first free-text intention-driven 3D object navigation benchmark—requiring agents to infer targets, actively search, and determine task completion without being given target names. The benchmark employs a paired design across four instruction styles and four intention types, decoupling surface linguistic form from semantic cues under identical scene geometries to enable fine-grained evaluation of target inference, language robustness, and end-task success. Built on Isaac Sim, the environment comprises 176 scenes, 64 object categories, and 500 intention-based instructions. Experiments show that state-of-the-art models achieve a 48.3% target identification rate and reach within 2 meters of the target in 68.7% of cases, yet attain only a 24.9% end-task success rate, underscoring implicit intention understanding as a core challenge.
📝 Abstract
Existing object navigation benchmarks usually tell an embodied agent which object category to find, such as microwave or chair. Human-facing embodied AI is often asked something less direct: "I need something to warm this food" or "the room feels stuffy." The agent must infer the object that can satisfy the need, find a scene-grounded instance, and decide whether the goal has been reached. We study this setting as intent-driven object navigation and introduce IntentionNav, a diagnostic benchmark for active object search from implicit human instructions. Each episode provides a free-text intent, RGB-D observations, and pose, but withholds the target object name. IntentionNav contains 500 intents over 176 Isaac Sim scenes and 64 target categories. Each intent is rewritten in four controlled instruction styles and annotated with one of four intent modes, separating surface phrasing from semantic cue type under matched geometry. This paired design supports analysis of target inference, language robustness, neighborhood reachability, and terminal success rather than only aggregate success. We evaluated three VLMs using a fixed active-navigation agent. Models identify the intended target in 48.3 percent of episodes and enter its 2 m neighborhood in 68.7 percent, but terminate successfully in only 24.9 percent and achieve grounded 1 m success in 5.5 percent. Success is highest for event-script intents (28.7 percent) and lower for physical-state and affordance intents (19.2 percent and 18.5 percent), showing that indirect human intent remains a bottleneck for target selection, visual verification, and terminal localization in active embodied search.