When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution

πŸ“… 2026-05-14
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

238K/year
πŸ€– AI Summary
Existing embodied AI benchmarks are largely confined to short-horizon navigation or manipulation tasks, making them inadequate for evaluating high-level planning and sustained reasoning in long-horizon household activities. To address this gap, this work introduces LongAct, a novel benchmark that reframes such tasks as high-level cognitive challenges and adopts an evaluation paradigm decoupled from low-level control. Building upon this framework, we develop HoloMind, an agent integrating a DAG-based hierarchical planner, multimodal spatial memory, episodic memory replay, and a global Critic for reflective self-supervision, enabling memory reuse and autonomous learning. Experiments demonstrate that HoloMind substantially improves performance on long-horizon tasks while reducing reliance on large model scale; notably, even state-of-the-art models achieve only a 59% goal-completion rate and a mere 16% end-to-end success rate on LongAct, underscoring the benchmark’s difficulty and its value for advancing embodied AI research.
πŸ“ Abstract
Long-horizon household tasks demand robust high-level planning and sustained reasoning capabilities, which are largely overlooked by existing embodied AI benchmarks that emphasize short-horizon navigation or manipulation and rely on fixed task categories. We introduce LongAct, a benchmark designed to evaluate planning-level autonomy in long-horizon household tasks specified through free-form instructions. By abstracting away embodiment-specific low-level control, LongAct isolates high-level cognitive capabilities such as instruction understanding, dependency management, memory maintenance, and adaptive planning. We further propose HoloMind, a VLM-driven agent with a DAG-based long-horizon hierarchical planner, a Multimodal Spatial Memory for persistent world modeling, an Episodic Memory for experience reuse, and a global Critic for reflective supervision. Experiments with GPT-5 and Qwen3-VL models show that HoloMind substantially improves long-horizon performance while reducing reliance on model scale. Even top models achieve only 59% goal completion and 16% full-task success, underscoring the difficulty of LongAct and the need for stronger long-horizon planning in embodied agents.
Problem

Research questions and friction points this paper is trying to address.

long-horizon tasks
household task execution
embodied AI
high-level planning
sustained reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Long-horizon planning
Embodied AI benchmark
Hierarchical task decomposition
Multimodal spatial memory
Reflective supervision