🤖 AI Summary
This work addresses the challenge of enabling humanoid robots to jointly perceive, navigate, and manipulate in partially observable, dynamically changing real-world environments while robustly transitioning across heterogeneous subtasks. To this end, the paper introduces EgoActing—a novel task formulation—and presents the first end-to-end scalable vision-language model (VLM) framework that directly maps high-level instructions to embodied, spatially aware action sequences, thereby bridging abstract planning and concrete execution. The model is trained via multi-source supervision combining first-person RGB observations, spatial reasoning question answering, and simulation-based demonstrations, enabling real-time inference (<1 second) at both 8B and 4B parameter scales. Experiments demonstrate that the approach generalizes effectively across diverse tasks and unseen scenarios in both simulation and real-world settings, significantly enhancing execution fluency and robustness.
📝 Abstract
Deploying humanoid robots in real-world settings is fundamentally challenging, as it demands tight integration of perception, locomotion, and manipulation under partial-information observations and dynamically changing environments. As well as transitioning robustly between sub-tasks of different types. Towards addressing these challenges, we propose a novel task - EgoActing, which requires directly grounding high-level instructions into various, precise, spatially aware humanoid actions. We further instantiate this task by introducing EgoActor, a unified and scalable vision-language model (VLM) that can predict locomotion primitives (e.g., walk, turn, move sideways, change height), head movements, manipulation commands, and human-robot interactions to coordinate perception and execution in real-time. We leverage broad supervision over egocentric RGB-only data from real-world demonstrations, spatial reasoning question-answering, and simulated environment demonstrations, enabling EgoActor to make robust, context-aware decisions and perform fluent action inference (under 1s) with both 8B and 4B parameter models. Extensive evaluations in both simulated and real-world environments demonstrate that EgoActor effectively bridges abstract task planning and concrete motor execution, while generalizing across diverse tasks and unseen environments.