LARM: Large Auto-Regressive Model for Long-Horizon Embodied Intelligence

📅 2024-05-27
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF

career value

242K/year
🤖 AI Summary
Long-horizon embodied intelligence faces dual bottlenecks—sparse rewards in reinforcement learning (RL) and inefficient large language model (LLM) deployment. Method: We propose Referee RL, a novel paradigm wherein a lightweight autoregressive action model (<5B parameters) directly outputs executable actions—bypassing text generation—while an external LLM acts as a dynamic “referee” to evaluate long-term behavior and reconstruct dense reward signals. Theoretically, we characterize the reward vanishing mechanism of conventional RL in open-world settings and design an end-to-end feedback-reconstruction training framework. Results: On Minecraft, our method achieves, for the first time, ultra-long-horizon tasks such as “enchant diamond gear” (exceeding state-of-the-art step counts significantly), enabling zero-human-intervention multi-task generalization. It simultaneously ensures strong generalization, computational efficiency, and practical deployability.

Technology Category

Application Category

📝 Abstract
Recent embodied agents are primarily built based on reinforcement learning (RL) or large language models (LLMs). Among them, RL agents are efficient for deployment but only perform very few tasks. By contrast, giant LLM agents (often more than 1000B parameters) present strong generalization while demanding enormous computing resources. In this work, we combine their advantages while avoiding the drawbacks by conducting the proposed referee RL on our developed large auto-regressive model (LARM). Specifically, LARM is built upon a lightweight LLM (fewer than 5B parameters) and directly outputs the next action to execute rather than text. We mathematically reveal that classic RL feedbacks vanish in long-horizon embodied exploration and introduce a giant LLM based referee to handle this reward vanishment during training LARM. In this way, LARM learns to complete diverse open-world tasks without human intervention. Especially, LARM successfully harvests enchanted diamond equipment in Minecraft, which demands significantly longer decision-making chains than the highest achievements of prior best methods.
Problem

Research questions and friction points this paper is trying to address.

Combines RL and LLM advantages
Addresses reward vanishment in exploration
Enables diverse open-world task completion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Auto-Regressive Model
Referee Reinforcement Learning
Lightweight Language Model