🤖 AI Summary
This work addresses the challenge of enabling humanoid robots to execute long-horizon natural language instructions in partially observable environments by proposing a closed-loop framework that translates task plans generated by vision-language models (VLMs) into verifiable subtasks with explicit preconditions and success criteria. Leveraging SAM3 and RGB-D sensing for multi-object 3D geometric perception, the system evaluates predicate conditions within a stable coordinate frame and selects motion primitives subject to reachability and balance constraints. Task progression is driven by state verification, and failures trigger diagnosis-informed replanning. Evaluated on tabletop and mobile manipulation tasks, the approach demonstrates significantly enhanced robustness, attributable to high-precision 3D localization, temporal stability, and an efficient recovery mechanism.
📝 Abstract
Robots are increasingly expected to execute open ended natural language requests in human environments, which demands reliable long horizon execution under partial observability. This is especially challenging for humanoids because locomotion and manipulation are tightly coupled through stance, reachability, and balance. We present a humanoid agent framework that turns VLM plans into verifiable task programs and closes the loop with multi object 3D geometric supervision. A VLM planner compiles each instruction into a typed JSON sequence of subtasks with explicit predicate based preconditions and success conditions. Using SAM3 and RGB-D, we ground all task relevant entities in 3D, estimate object centroids and extents, and evaluate predicates over stable frames to obtain condition level diagnostics. The supervisor uses these diagnostics to verify subtask completion and to provide condition-level feedback for progression and replanning. We execute each subtask by coordinating humanoid locomotion and whole-body manipulation, selecting feasible motion primitives under reachability and balance constraints. Experiments on tabletop manipulation and long horizon humanoid loco manipulation tasks show improved robustness from multi object grounding, temporal stability, and recovery driven replanning.