LoLA: Long Horizon Latent Action Learning for General Robot Manipulation

📅 2025-12-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current Vision-Language-Action (VLA) models struggle to model historical state dynamics and generate temporally coherent action sequences, limiting their effectiveness in long-horizon language-guided robotic manipulation. To address this, we propose a state-aware implicit re-representation framework: (1) an embodied anchor-based latent space that physically aligns multi-view vision, proprioception, and language instructions—moving beyond naive concatenation of proprioceptive features; (2) integrated visual-language context encoding, state-driven latent mapping, multi-source sensor fusion, and end-to-end sequential action generation. Experiments on SIMPLER/LIBERO simulation benchmarks and real-world Franka and Bi-Manual Aloha platforms demonstrate substantial improvements over state-of-the-art methods (e.g., pi0), particularly in long-horizon task success rates. These results empirically validate the efficacy of explicit historical state modeling and temporally consistent action sequencing.

Technology Category

Application Category

📝 Abstract
The capability of performing long-horizon, language-guided robotic manipulation tasks critically relies on leveraging historical information and generating coherent action sequences. However, such capabilities are often overlooked by existing Vision-Language-Action (VLA) models. To solve this challenge, we propose LoLA (Long Horizon Latent Action Learning), a framework designed for robot manipulation that integrates long-term multi-view observations and robot proprioception to enable multi-step reasoning and action generation. We first employ Vision-Language Models to encode rich contextual features from historical sequences and multi-view observations. We further introduces a key module, State-Aware Latent Re-representation, which transforms visual inputs and language commands into actionable robot motion space. Unlike existing VLA approaches that merely concatenate robot proprioception (e.g., joint angles) with VL embeddings, this module leverages such robot states to explicitly ground VL representations in physical scale through a learnable "embodiment-anchored" latent space. We trained LoLA on diverse robotic pre-training datasets and conducted extensive evaluations on simulation benchmarks (SIMPLER and LIBERO), as well as two real-world tasks on Franka and Bi-Manual Aloha robots. Results show that LoLA significantly outperforms prior state-of-the-art methods (e.g., pi0), particularly in long-horizon manipulation tasks.
Problem

Research questions and friction points this paper is trying to address.

Enables long-horizon language-guided robot manipulation tasks
Integrates historical multi-view observations and robot proprioception
Transforms visual and language inputs into actionable robot motions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates long-term multi-view observations and robot proprioception
Transforms visual inputs and language commands into actionable robot motion space
Uses embodiment-anchored latent space to ground representations in physical scale