π€ AI Summary
Existing embodied AI benchmarks are largely confined to short-horizon navigation or manipulation tasks, making them inadequate for evaluating high-level planning and sustained reasoning in long-horizon household activities. To address this gap, this work introduces LongAct, a novel benchmark that reframes such tasks as high-level cognitive challenges and adopts an evaluation paradigm decoupled from low-level control. Building upon this framework, we develop HoloMind, an agent integrating a DAG-based hierarchical planner, multimodal spatial memory, episodic memory replay, and a global Critic for reflective self-supervision, enabling memory reuse and autonomous learning. Experiments demonstrate that HoloMind substantially improves performance on long-horizon tasks while reducing reliance on large model scale; notably, even state-of-the-art models achieve only a 59% goal-completion rate and a mere 16% end-to-end success rate on LongAct, underscoring the benchmarkβs difficulty and its value for advancing embodied AI research.
π Abstract
Long-horizon household tasks demand robust high-level planning and sustained reasoning capabilities, which are largely overlooked by existing embodied AI benchmarks that emphasize short-horizon navigation or manipulation and rely on fixed task categories. We introduce LongAct, a benchmark designed to evaluate planning-level autonomy in long-horizon household tasks specified through free-form instructions. By abstracting away embodiment-specific low-level control, LongAct isolates high-level cognitive capabilities such as instruction understanding, dependency management, memory maintenance, and adaptive planning. We further propose HoloMind, a VLM-driven agent with a DAG-based long-horizon hierarchical planner, a Multimodal Spatial Memory for persistent world modeling, an Episodic Memory for experience reuse, and a global Critic for reflective supervision. Experiments with GPT-5 and Qwen3-VL models show that HoloMind substantially improves long-horizon performance while reducing reliance on model scale. Even top models achieve only 59% goal completion and 16% full-task success, underscoring the difficulty of LongAct and the need for stronger long-horizon planning in embodied agents.