When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

254K/year

🤖 AI Summary

Existing embodied AI benchmarks are largely confined to short-horizon navigation or manipulation tasks, making them inadequate for evaluating high-level planning and sustained reasoning in long-horizon household activities. To address this gap, this work introduces LongAct, a novel benchmark that reframes such tasks as high-level cognitive challenges and adopts an evaluation paradigm decoupled from low-level control. Building upon this framework, we develop HoloMind, an agent integrating a DAG-based hierarchical planner, multimodal spatial memory, episodic memory replay, and a global Critic for reflective self-supervision, enabling memory reuse and autonomous learning. Experiments demonstrate that HoloMind substantially improves performance on long-horizon tasks while reducing reliance on large model scale; notably, even state-of-the-art models achieve only a 59% goal-completion rate and a mere 16% end-to-end success rate on LongAct, underscoring the benchmark’s difficulty and its value for advancing embodied AI research.

📝 Abstract

Long-horizon household tasks demand robust high-level planning and sustained reasoning capabilities, which are largely overlooked by existing embodied AI benchmarks that emphasize short-horizon navigation or manipulation and rely on fixed task categories. We introduce LongAct, a benchmark designed to evaluate planning-level autonomy in long-horizon household tasks specified through free-form instructions. By abstracting away embodiment-specific low-level control, LongAct isolates high-level cognitive capabilities such as instruction understanding, dependency management, memory maintenance, and adaptive planning. We further propose HoloMind, a VLM-driven agent with a DAG-based long-horizon hierarchical planner, a Multimodal Spatial Memory for persistent world modeling, an Episodic Memory for experience reuse, and a global Critic for reflective supervision. Experiments with GPT-5 and Qwen3-VL models show that HoloMind substantially improves long-horizon performance while reducing reliance on model scale. Even top models achieve only 59% goal completion and 16% full-task success, underscoring the difficulty of LongAct and the need for stronger long-horizon planning in embodied agents.

Problem

Research questions and friction points this paper is trying to address.

long-horizon tasks

household task execution

embodied AI

high-level planning

sustained reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Long-horizon planning

Embodied AI benchmark

Hierarchical task decomposition