Seeing Space and Motion: Enhancing Latent Actions with Spatial and Dynamic Awareness for VLA

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Latent Action Models (LAMs) suffer from two key limitations in vision-language-action (VLA) systems: weak spatial understanding and unstable long-horizon temporal modeling—hindering the clarity and robustness of action representations. To address these, we propose Farsighted-LAM: (1) a geometry-aware image encoder to enhance spatial structural reasoning; (2) a multi-scale temporal modeling mechanism to improve cross-frame dynamic perception; and (3) an SSM-VLA architecture integrating structured perception with a Visual Chain-of-Thought (Visual CoT) module for explicit environmental dynamics reasoning. Evaluated across multiple VLA benchmarks in both simulation and real-world settings, Farsighted-LAM achieves state-of-the-art performance, significantly improving action prediction stability, semantic consistency, and decision interpretability.

Technology Category

Application Category

📝 Abstract
Latent Action Models (LAMs) enable Vision- Language-Action (VLA) systems to learn semantic action rep- resentations from large-scale unannotated data. Yet, we identify two bottlenecks of LAMs: 1) the commonly adopted end-to-end trained image encoder suffers from poor spatial understanding; 2) LAMs can be fragile when input frames are distant, leading to limited temporal perception. Such factors inevitably hinder stable and clear action modeling. To this end, we propose Farsighted-LAM, a latent action framework with geometry- aware spatial encoding and multi-scale temporal modeling, capturing structural priors and dynamic motion patterns from consecutive frames. We further propose SSM-VLA, an end- to-end VLA framework built upon Farsighted-LAM, which integrates structured perception with a visual Chain-of-Thought module to explicitly reason about environmental dynamics, enhancing decision consistency and interpretability. We validate SSM-VLA on multiple VLA tasks in both simulation and real- world settings, and achieve state-of-the-art performance. Our results demonstrate that our strategy of combining geometry- aware modeling, temporal coherence, and explicit reasoning is effective in enhancing the robustness and generalizability of embodied intelligence.
Problem

Research questions and friction points this paper is trying to address.

Improving spatial understanding in latent action models
Enhancing temporal perception for distant input frames
Increasing decision consistency and interpretability in VLA systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Geometry-aware spatial encoding for structural priors
Multi-scale temporal modeling for dynamic motion
Visual Chain-of-Thought for explicit reasoning dynamics
🔎 Similar Papers
No similar papers found.
Z
Zhejia Cai
Tsinghua Shenzhen International Graduate School, Tsinghua University
Yandan Yang
Yandan Yang
BIGAI (Beijing Institute for General Artificial Intelligence)
Computer VisionGenerationEmbodied AI
Xinyuan Chang
Xinyuan Chang
Xi'an Jiaotong University; Alibaba-Amap
Autonomous Driving,Computer Vision
S
Shiyi Liang
School of Software Engineering, Xi’an Jiaotong University
R
Ronghan Chen
Amap, Alibaba Group
F
Feng Xiong
Amap, Alibaba Group
M
Mu Xu
Amap, Alibaba Group
Ruqi Huang
Ruqi Huang
Tsinghua Shenzhen International Graduate School
3D Computer VisionShape AnalysisGeometry Processing