🤖 AI Summary
Latent Action Models (LAMs) suffer from two key limitations in vision-language-action (VLA) systems: weak spatial understanding and unstable long-horizon temporal modeling—hindering the clarity and robustness of action representations. To address these, we propose Farsighted-LAM: (1) a geometry-aware image encoder to enhance spatial structural reasoning; (2) a multi-scale temporal modeling mechanism to improve cross-frame dynamic perception; and (3) an SSM-VLA architecture integrating structured perception with a Visual Chain-of-Thought (Visual CoT) module for explicit environmental dynamics reasoning. Evaluated across multiple VLA benchmarks in both simulation and real-world settings, Farsighted-LAM achieves state-of-the-art performance, significantly improving action prediction stability, semantic consistency, and decision interpretability.
📝 Abstract
Latent Action Models (LAMs) enable Vision- Language-Action (VLA) systems to learn semantic action rep- resentations from large-scale unannotated data. Yet, we identify two bottlenecks of LAMs: 1) the commonly adopted end-to-end trained image encoder suffers from poor spatial understanding; 2) LAMs can be fragile when input frames are distant, leading to limited temporal perception. Such factors inevitably hinder stable and clear action modeling. To this end, we propose Farsighted-LAM, a latent action framework with geometry- aware spatial encoding and multi-scale temporal modeling, capturing structural priors and dynamic motion patterns from consecutive frames. We further propose SSM-VLA, an end- to-end VLA framework built upon Farsighted-LAM, which integrates structured perception with a visual Chain-of-Thought module to explicitly reason about environmental dynamics, enhancing decision consistency and interpretability. We validate SSM-VLA on multiple VLA tasks in both simulation and real- world settings, and achieve state-of-the-art performance. Our results demonstrate that our strategy of combining geometry- aware modeling, temporal coherence, and explicit reasoning is effective in enhancing the robustness and generalizability of embodied intelligence.