SA-VLA: Spatially-Aware Flow-Matching for Vision-Language-Action Reinforcement Learning

📅 2026-01-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the degraded robustness of Vision-Language-Action (VLA) models during reinforcement learning fine-tuning, which stems from spatial distribution shifts caused by the erosion of spatial inductive biases during training. To mitigate this issue, the authors propose a spatially aware reinforcement learning adaptation framework featuring three key innovations: a representation learning scheme that integrates implicit spatial representations with visual tokens, a dense reward mechanism grounded in task-specific geometric progress, and a spatially conditioned annealing exploration strategy for flow-matching dynamics, termed SCAN. Evaluated on multi-object manipulation and cluttered-scene benchmarks, the approach enables stable fine-tuning and substantially improves zero-shot spatial generalization and policy robustness.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models exhibit strong generalization in robotic manipulation, yet reinforcement learning (RL) fine-tuning often degrades robustness under spatial distribution shifts. For flow-matching VLA policies, this degradation is closely associated with the erosion of spatial inductive bias during RL adaptation, as sparse rewards and spatially agnostic exploration increasingly favor short-horizon visual cues. To address this issue, we propose \textbf{SA-VLA}, a spatially-aware RL adaptation framework that preserves spatial grounding during policy optimization by aligning representation learning, reward design, and exploration with task geometry. SA-VLA fuses implicit spatial representations with visual tokens, provides dense rewards that reflect geometric progress, and employs \textbf{SCAN}, a spatially-conditioned annealed exploration strategy tailored to flow-matching dynamics. Across challenging multi-object and cluttered manipulation benchmarks, SA-VLA enables stable RL fine-tuning and improves zero-shot spatial generalization, yielding more robust and transferable behaviors. Code and project page are available at https://xupan.top/Projects/savla.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
spatial distribution shift
reinforcement learning fine-tuning
spatial inductive bias
flow-matching
Innovation

Methods, ideas, or system contributions that make the work stand out.

spatially-aware reinforcement learning
vision-language-action models
flow-matching
geometric reward design
spatially-conditioned exploration
🔎 Similar Papers
No similar papers found.
Xu Pan
Xu Pan
Harvard University
computational neurosciencedeep learning
Z
Zhenglin Wan
Department of Computer Science, National University of Singapore, Singapore
Xingrui Yu
Xingrui Yu
Scientist, CFAR, A*STAR
Machine LearningRobust Imitation LearningTrustworthy AI
X
Xianwei Zheng
State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing (LIESMARS), Wuhan University, Wuhan, P.R. China
Y
Youkai Ke
State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing (LIESMARS), Wuhan University, Wuhan, P.R. China
M
Ming Sun
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, P.R. China
Rui Wang
Rui Wang
Institute of Automation, Chinese Academy of Sciences
Biomimetic robotunderwater robotintelligent controlmechatronics
Ziwei Wang
Ziwei Wang
School of Electrical and Electronic Engineering, Nanyang Technological University
embodied AIroboticscomputer vision
I
Ivor Tsang
Centre for Frontier AI Research (CFAR), Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), Singapore