ASH: Agents that Self-Hone via Embodied Learning

📅 2026-05-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

229K/year
🤖 AI Summary
This work addresses the scalability limitations of long-horizon embodied intelligence tasks, which traditionally rely on handcrafted rewards or expert demonstrations. The authors propose a self-iterative learning framework that requires neither human-provided rewards nor demonstrations. Starting from unlabeled, noisy internet videos, the method learns an initial policy and, upon encountering execution failures, trains an inverse dynamics model on its own trajectories to extract supervisory signals for policy self-improvement. The approach integrates unsupervised keyframe detection, behavioral cloning, and retrieval-augmented mechanisms, complemented by a long-term memory module that stores critical experience segments to support multi-hour planning. Evaluated on *Pokemon Emerald* and *The Legend of Zelda*, the method achieves 11.2/12 and 9.9/12 milestones, respectively—substantially outperforming the strongest baseline (approximately 6/12)—and demonstrates continuous progress over 8-hour evaluation episodes.
📝 Abstract
Long-horizon embodied tasks remain a fundamental challenge in AI, as current methods rely on hand-engineered rewards or action-labeled demonstrations, neither of which scales. We introduce ASH, an agentic system that learns an embodied policy from unlabeled, noisy internet video, without reward shaping or expert annotation. ASH follows a self-improvement loop; when it gets stuck, ASH learns an Inverse Dynamics Model (IDM) from its own trajectories, and uses its IDM to extract supervision from relevant internet video. ASH uses unsupervised learning to identify key moments from large-scale internet video and retains them as long-term memory -- allowing it to tackle long-horizon problems. We evaluate ASH on two complementary environments demanding multi-hour planning: Pokemon Emerald, a turn-based RPG, and The Legend of Zelda: The Minish Cap, a real-time action-adventure game. In both games, behavioral cloning, retrieval-augmented and zero-shot foundation-model baselines plateau, while ASH sustains progression across our 8-hour evaluation. ASH reaches an average of $11.2/12$ milestones in Pokemon Emerald and $9.9/12$ in Legend of Zelda, while the strongest baseline gets stuck in both environments at an average of $6.5/12$ and $6.0/12$ milestones, respectively. We demonstrate that self-improving agents are a scalable recipe for long-horizon embodied learning.
Problem

Research questions and friction points this paper is trying to address.

long-horizon embodied tasks
reward-free learning
unlabeled internet video
scalable embodied AI
self-improving agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-improving agents
embodied learning
inverse dynamics model
unsupervised video learning
long-horizon planning
🔎 Similar Papers
No similar papers found.