🤖 AI Summary
This work addresses the challenge of enabling household agents to generate executable tasks directly from natural language in full-scale home environments, where scene descriptions are often redundant and constrained by the context length and privacy requirements of local models. The study formalizes, for the first time, the problem of “full-scene household reasoning” and introduces TaskGround—a training-free, model-agnostic Ground-Infer-Execute framework. TaskGround leverages task-oriented scene segmentation, structured task reasoning, and skill-level action compilation to efficiently produce executable action sequences. The approach substantially enhances the performance of small local models, enabling Qwen3.5-9B to achieve task success rates on the FullHome benchmark comparable to those of GPT-5 while reducing input token consumption by up to 18×.
📝 Abstract
In real home deployments, household agents must often operate from a complete household scene and a situated household request, rather than from a clean task specification. Such requests require agents to identify task-relevant entities, recover intended task conditions, and resolve ordering constraints from the surrounding scene context. We formalize this capability as full-scene household reasoning: given a complete household scene and a situated household request, an agent must infer executable task structure before producing a grounded skill-level action sequence. This setting is challenging because complete household scenes contain substantial task-irrelevant information, making direct complete-scene prompting inefficient and error-prone. In practical deployment, this challenge is further amplified by privacy and local compute constraints, which favor compact open-weight models with limited long-context reasoning ability. We propose TaskGround, a training-free and model-agnostic Ground-Infer-Execute framework that grounds complete scenes into compact task-relevant scene slices, infers executable task structure, and compiles it into grounded skill-level action sequences. To evaluate this setting, we introduce FullHome, a human-validated evaluation suite of 400 household tasks spanning diverse home-scale environments and both goal-oriented and process-constrained requirements. On FullHome, TaskGround improves task success rates by large margins across both proprietary and open-weight models. Notably, it makes Qwen3.5-9B competitive with GPT-5 under direct complete-scene prompting while reducing total input-token cost by up to 18x. Our results identify executable task-structure inference as a central bottleneck in full-scene household reasoning and show that structured grounding can make compact local models substantially more effective for practical household deployment.