π€ AI Summary
Existing vision-language models often suffer from hallucinations, missing steps, and violations of physical constraints in long-horizon embodied tasks due to insufficient fine-grained human reasoning and spatial grounding. To address this, this work introduces a βthink-before-actβ protocol and constructs the first egocentric dataset featuring word-level timestamps in spoken chains-of-thought, synchronously aligning verbal reasoning, metric spatial estimates, scene memory banks, and clip-level action labels. Leveraging this resource, we establish a long-horizon evaluation benchmark encompassing over one hundred household tasks. Through chain-of-thought alignment fine-tuning with human demonstrations, our approach substantially improves model performance in planning, stepwise reasoning, instruction following, and spatial grounding, thereby exposing critical limitations of foundation models in embodied assistance and open-world simulation.
π Abstract
Large foundation models have made significant advances in embodied intelligence, enabling synthesis and reasoning over egocentric input for household tasks. However, VLM-based auto-labeling is often noisy because the primary data sources lack accurate human action labels, chain-of-thought (CoT), and spatial annotations; these errors are amplified during long-horizon spatial instruction following. These issues stem from insufficient coverage of minute-long, daily household planning tasks and from inaccurate spatial grounding. As a result, VLM reasoning chains and world-model synthesis can hallucinate objects, skip steps, or fail to respect real-world physical attributes. To address these gaps, we introduce EgoTL. EgoTL builds a think-aloud capture pipeline for egocentric data. It uses a say-before-act protocol to record step-by-step goals and spoken reasoning with word-level timestamps, then calibrates physical properties with metric-scale spatial estimators, a memory-bank walkthrough for scene context, and clip-level tags for navigation instructions and detailed manipulation actions. With EgoTL, we are able to benchmark VLMs and World Models on six task dimensions from three layers and long-horizon generation over minute-long sequences across over 100 daily household tasks. We find that foundation models still fall short as egocentric assistants or open-world simulators. Finally, we finetune foundation models with human CoT aligned with metric labels on the training split of EgoTL, which improves long-horizon planning and reasoning, step-wise reasoning, instruction following, and spatial grounding.