Act2Goal: From World Model To General Goal-conditioned Policy

📅 2025-12-29

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses three key challenges in vision-guided long-horizon robotic manipulation: inadequate task progress modeling, poor generalization across tasks/objects/environments, and low execution robustness. We propose a novel framework integrating a goal-conditioned visual world model with multi-scale temporal hashing (MSTH). Our method employs an end-to-end coupled world model and policy architecture; MSTH enables fine-grained closed-loop control while preserving global state consistency. To enhance adaptability, we incorporate cross-modal cross-attention, LoRA-based lightweight fine-tuning, and hindsight goal relabeling—enabling zero-shot transfer and reward-free online adaptation. Evaluated on real robots, our approach boosts out-of-distribution task success rate from 30% to 90%, achieves autonomous optimization within minutes, and significantly improves generalization to novel objects, layouts, and environments.

Technology Category

Application Category

📝 Abstract

Specifying robotic manipulation tasks in a manner that is both expressive and precise remains a central challenge. While visual goals provide a compact and unambiguous task specification, existing goal-conditioned policies often struggle with long-horizon manipulation due to their reliance on single-step action prediction without explicit modeling of task progress. We propose Act2Goal, a general goal-conditioned manipulation policy that integrates a goal-conditioned visual world model with multi-scale temporal control. Given a current observation and a target visual goal, the world model generates a plausible sequence of intermediate visual states that captures long-horizon structure. To translate this visual plan into robust execution, we introduce Multi-Scale Temporal Hashing (MSTH), which decomposes the imagined trajectory into dense proximal frames for fine-grained closed-loop control and sparse distal frames that anchor global task consistency. The policy couples these representations with motor control through end-to-end cross-attention, enabling coherent long-horizon behavior while remaining reactive to local disturbances. Act2Goal achieves strong zero-shot generalization to novel objects, spatial layouts, and environments. We further enable reward-free online adaptation through hindsight goal relabeling with LoRA-based finetuning, allowing rapid autonomous improvement without external supervision. Real-robot experiments demonstrate that Act2Goal improves success rates from 30% to 90% on challenging out-of-distribution tasks within minutes of autonomous interaction, validating that goal-conditioned world models with multi-scale temporal control provide structured guidance necessary for robust long-horizon manipulation. Project page: https://act2goal.github.io/

Problem

Research questions and friction points this paper is trying to address.

Developing goal-conditioned policies for long-horizon robotic manipulation tasks

Translating visual plans into robust execution with multi-scale temporal control

Achieving zero-shot generalization and autonomous improvement without external supervision

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates goal-conditioned visual world model with multi-scale temporal control

Uses Multi-Scale Temporal Hashing for dense proximal and sparse distal frame decomposition

Enables reward-free online adaptation through hindsight goal relabeling with LoRA finetuning

🔎 Similar Papers

No similar papers found.

Meta

$7,650/month to $12,134/month + benefits

Redmond, WA +1 location

Promotion (PhD): KI-basierte Lernstrategien für Smart Manufacturing im europäischen HORIZON-Projekt

Bosch Group

ARENA2036 in Stuttgart

Master Thesis Reinforcement Learning for Behavior Planning in Automated Driving

Bosch Group

Renningen, BW, DE

Senior Robotics Engineer- Spot Manipulation

Boston Dynamics

The base pay range for this position is between $155,000 to $220,000 annually. Base pay will depend on multiple individualized factors including, but not limited to internal equity, job related knowledge, skills and experience. This range represents a good faith estimate of compensation at the time of posting. Boston Dynamics offers a generous Benefits package including medical, dental vision, 401(k), paid time off and a annual bonus structure. Additional details regarding these benefit plans will be provided if an employee receives an offer for employment.

Waltham, MA

Machine Learning Research Scientist, Mechanical Intuition in Multimodal Models

Toyota Research Institute

Los Altos, CA / Cambridge, MA

Authors to Follow