ASH: Agents that Self-Hone via Embodied Learning

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This work addresses the scalability limitations of long-horizon embodied intelligence tasks, which traditionally rely on handcrafted rewards or expert demonstrations. The authors propose a self-iterative learning framework that requires neither human-provided rewards nor demonstrations. Starting from unlabeled, noisy internet videos, the method learns an initial policy and, upon encountering execution failures, trains an inverse dynamics model on its own trajectories to extract supervisory signals for policy self-improvement. The approach integrates unsupervised keyframe detection, behavioral cloning, and retrieval-augmented mechanisms, complemented by a long-term memory module that stores critical experience segments to support multi-hour planning. Evaluated on *Pokemon Emerald* and *The Legend of Zelda*, the method achieves 11.2/12 and 9.9/12 milestones, respectively—substantially outperforming the strongest baseline (approximately 6/12)—and demonstrates continuous progress over 8-hour evaluation episodes.

📝 Abstract

Long-horizon embodied tasks remain a fundamental challenge in AI, as current methods rely on hand-engineered rewards or action-labeled demonstrations, neither of which scales. We introduce ASH, an agentic system that learns an embodied policy from unlabeled, noisy internet video, without reward shaping or expert annotation. ASH follows a self-improvement loop; when it gets stuck, ASH learns an Inverse Dynamics Model (IDM) from its own trajectories, and uses its IDM to extract supervision from relevant internet video. ASH uses unsupervised learning to identify key moments from large-scale internet video and retains them as long-term memory -- allowing it to tackle long-horizon problems. We evaluate ASH on two complementary environments demanding multi-hour planning: Pokemon Emerald, a turn-based RPG, and The Legend of Zelda: The Minish Cap, a real-time action-adventure game. In both games, behavioral cloning, retrieval-augmented and zero-shot foundation-model baselines plateau, while ASH sustains progression across our 8-hour evaluation. ASH reaches an average of $11.2/12$ milestones in Pokemon Emerald and $9.9/12$ in Legend of Zelda, while the strongest baseline gets stuck in both environments at an average of $6.5/12$ and $6.0/12$ milestones, respectively. We demonstrate that self-improving agents are a scalable recipe for long-horizon embodied learning.

Problem

Research questions and friction points this paper is trying to address.

long-horizon embodied tasks

reward-free learning

unlabeled internet video

scalable embodied AI

self-improving agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-improving agents

embodied learning

inverse dynamics model

unsupervised video learning

long-horizon planning

🔎 Similar Papers

No similar papers found.

Toyota Research Institute

Los Altos, CA

AI Research Scientist, Robotics