🤖 AI Summary
This work addresses the challenges faced by humanoid agents in active visual search within 360° immersive environments, where high cognitive load and reliance on expensive trajectory-level annotations often hinder performance. The authors propose the *Imagining in 360°* framework, which employs a decoupled architecture separating exploration into two components: an Imaginator that infers the semantic spatial layout of both observed and unobserved regions in a single step to generate probabilistic spatial priors, and an Actor that conducts robust search based on these priors. Notably, the method eliminates the need for trajectory-level chain-of-thought annotations by leveraging multi-hypothesis sampling and semantic layout modeling to synthesize large-scale training data. Evaluated in real-world complex environments, the approach significantly improves search efficiency and success rates, yielding a high-quality dataset of over 1.96 million samples.
📝 Abstract
Humanoid Visual Search (HVS) requires agents to actively explore immersive 360$^\circ$ environments. While prior methods treat this as a monolithic task relying on cumulative, multi-turn Chain-of-Thought (CoT) reasoning, they impose heavy cognitive burdens and require expensive trajectory-level annotations. In this paper, we propose Imagining in 360$^\circ$, a novel framework that decouples the exploration process into a specialized Imaginator and an Actor. The Imaginator functions as a probabilistic predictor of spatial priors; instead of maintaining a cumulative reasoning chain, it infers the semantic layout of both observed and unobserved regions in a single step. By sampling multiple hypotheses within this semantic space, we provide the Actor with a distribution of effective spatial information, offering robust guidance that hedges against uncertainty during active search. This decoupled architecture significantly lowers data engineering costs by eliminating the need for full-trajectory CoT annotations, enabling the generation of over 1.96 million curated training samples. Extensive experiments demonstrate that explicitly modeling semantic spatial priors drastically improves search efficiency and success rates in complex, in-the-wild environments.