Seeing Space and Motion: Enhancing Latent Actions with Spatial and Dynamic Awareness for VLA

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF

career value

218K/year
🤖 AI Summary
Latent Action Models (LAMs) suffer from two key limitations in vision-language-action (VLA) systems: weak spatial understanding and unstable long-horizon temporal modeling—hindering the clarity and robustness of action representations. To address these, we propose Farsighted-LAM: (1) a geometry-aware image encoder to enhance spatial structural reasoning; (2) a multi-scale temporal modeling mechanism to improve cross-frame dynamic perception; and (3) an SSM-VLA architecture integrating structured perception with a Visual Chain-of-Thought (Visual CoT) module for explicit environmental dynamics reasoning. Evaluated across multiple VLA benchmarks in both simulation and real-world settings, Farsighted-LAM achieves state-of-the-art performance, significantly improving action prediction stability, semantic consistency, and decision interpretability.

Technology Category

Application Category

📝 Abstract
Latent Action Models (LAMs) enable Vision- Language-Action (VLA) systems to learn semantic action rep- resentations from large-scale unannotated data. Yet, we identify two bottlenecks of LAMs: 1) the commonly adopted end-to-end trained image encoder suffers from poor spatial understanding; 2) LAMs can be fragile when input frames are distant, leading to limited temporal perception. Such factors inevitably hinder stable and clear action modeling. To this end, we propose Farsighted-LAM, a latent action framework with geometry- aware spatial encoding and multi-scale temporal modeling, capturing structural priors and dynamic motion patterns from consecutive frames. We further propose SSM-VLA, an end- to-end VLA framework built upon Farsighted-LAM, which integrates structured perception with a visual Chain-of-Thought module to explicitly reason about environmental dynamics, enhancing decision consistency and interpretability. We validate SSM-VLA on multiple VLA tasks in both simulation and real- world settings, and achieve state-of-the-art performance. Our results demonstrate that our strategy of combining geometry- aware modeling, temporal coherence, and explicit reasoning is effective in enhancing the robustness and generalizability of embodied intelligence.
Problem

Research questions and friction points this paper is trying to address.

Improving spatial understanding in latent action models
Enhancing temporal perception for distant input frames
Increasing decision consistency and interpretability in VLA systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Geometry-aware spatial encoding for structural priors
Multi-scale temporal modeling for dynamic motion
Visual Chain-of-Thought for explicit reasoning dynamics
🔎 Similar Papers
No similar papers found.