🤖 AI Summary
Current Vision-Language-Action (VLA) models struggle to model historical state dynamics and generate temporally coherent action sequences, limiting their effectiveness in long-horizon language-guided robotic manipulation. To address this, we propose a state-aware implicit re-representation framework: (1) an embodied anchor-based latent space that physically aligns multi-view vision, proprioception, and language instructions—moving beyond naive concatenation of proprioceptive features; (2) integrated visual-language context encoding, state-driven latent mapping, multi-source sensor fusion, and end-to-end sequential action generation. Experiments on SIMPLER/LIBERO simulation benchmarks and real-world Franka and Bi-Manual Aloha platforms demonstrate substantial improvements over state-of-the-art methods (e.g., pi0), particularly in long-horizon task success rates. These results empirically validate the efficacy of explicit historical state modeling and temporally consistent action sequencing.
📝 Abstract
The capability of performing long-horizon, language-guided robotic manipulation tasks critically relies on leveraging historical information and generating coherent action sequences. However, such capabilities are often overlooked by existing Vision-Language-Action (VLA) models. To solve this challenge, we propose LoLA (Long Horizon Latent Action Learning), a framework designed for robot manipulation that integrates long-term multi-view observations and robot proprioception to enable multi-step reasoning and action generation. We first employ Vision-Language Models to encode rich contextual features from historical sequences and multi-view observations. We further introduces a key module, State-Aware Latent Re-representation, which transforms visual inputs and language commands into actionable robot motion space. Unlike existing VLA approaches that merely concatenate robot proprioception (e.g., joint angles) with VL embeddings, this module leverages such robot states to explicitly ground VL representations in physical scale through a learnable "embodiment-anchored" latent space. We trained LoLA on diverse robotic pre-training datasets and conducted extensive evaluations on simulation benchmarks (SIMPLER and LIBERO), as well as two real-world tasks on Franka and Bi-Manual Aloha robots. Results show that LoLA significantly outperforms prior state-of-the-art methods (e.g., pi0), particularly in long-horizon manipulation tasks.