Act2Goal: From World Model To General Goal-conditioned Policy

📅 2025-12-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses three key challenges in vision-guided long-horizon robotic manipulation: inadequate task progress modeling, poor generalization across tasks/objects/environments, and low execution robustness. We propose a novel framework integrating a goal-conditioned visual world model with multi-scale temporal hashing (MSTH). Our method employs an end-to-end coupled world model and policy architecture; MSTH enables fine-grained closed-loop control while preserving global state consistency. To enhance adaptability, we incorporate cross-modal cross-attention, LoRA-based lightweight fine-tuning, and hindsight goal relabeling—enabling zero-shot transfer and reward-free online adaptation. Evaluated on real robots, our approach boosts out-of-distribution task success rate from 30% to 90%, achieves autonomous optimization within minutes, and significantly improves generalization to novel objects, layouts, and environments.

Technology Category

Application Category

📝 Abstract
Specifying robotic manipulation tasks in a manner that is both expressive and precise remains a central challenge. While visual goals provide a compact and unambiguous task specification, existing goal-conditioned policies often struggle with long-horizon manipulation due to their reliance on single-step action prediction without explicit modeling of task progress. We propose Act2Goal, a general goal-conditioned manipulation policy that integrates a goal-conditioned visual world model with multi-scale temporal control. Given a current observation and a target visual goal, the world model generates a plausible sequence of intermediate visual states that captures long-horizon structure. To translate this visual plan into robust execution, we introduce Multi-Scale Temporal Hashing (MSTH), which decomposes the imagined trajectory into dense proximal frames for fine-grained closed-loop control and sparse distal frames that anchor global task consistency. The policy couples these representations with motor control through end-to-end cross-attention, enabling coherent long-horizon behavior while remaining reactive to local disturbances. Act2Goal achieves strong zero-shot generalization to novel objects, spatial layouts, and environments. We further enable reward-free online adaptation through hindsight goal relabeling with LoRA-based finetuning, allowing rapid autonomous improvement without external supervision. Real-robot experiments demonstrate that Act2Goal improves success rates from 30% to 90% on challenging out-of-distribution tasks within minutes of autonomous interaction, validating that goal-conditioned world models with multi-scale temporal control provide structured guidance necessary for robust long-horizon manipulation. Project page: https://act2goal.github.io/
Problem

Research questions and friction points this paper is trying to address.

Developing goal-conditioned policies for long-horizon robotic manipulation tasks
Translating visual plans into robust execution with multi-scale temporal control
Achieving zero-shot generalization and autonomous improvement without external supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates goal-conditioned visual world model with multi-scale temporal control
Uses Multi-Scale Temporal Hashing for dense proximal and sparse distal frame decomposition
Enables reward-free online adaptation through hindsight goal relabeling with LoRA finetuning
🔎 Similar Papers
No similar papers found.