SA-VLA: Spatially-Aware Flow-Matching for Vision-Language-Action Reinforcement Learning

📅 2026-01-31

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the degraded robustness of Vision-Language-Action (VLA) models during reinforcement learning fine-tuning, which stems from spatial distribution shifts caused by the erosion of spatial inductive biases during training. To mitigate this issue, the authors propose a spatially aware reinforcement learning adaptation framework featuring three key innovations: a representation learning scheme that integrates implicit spatial representations with visual tokens, a dense reward mechanism grounded in task-specific geometric progress, and a spatially conditioned annealing exploration strategy for flow-matching dynamics, termed SCAN. Evaluated on multi-object manipulation and cluttered-scene benchmarks, the approach enables stable fine-tuning and substantially improves zero-shot spatial generalization and policy robustness.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models exhibit strong generalization in robotic manipulation, yet reinforcement learning (RL) fine-tuning often degrades robustness under spatial distribution shifts. For flow-matching VLA policies, this degradation is closely associated with the erosion of spatial inductive bias during RL adaptation, as sparse rewards and spatially agnostic exploration increasingly favor short-horizon visual cues. To address this issue, we propose \textbf{SA-VLA}, a spatially-aware RL adaptation framework that preserves spatial grounding during policy optimization by aligning representation learning, reward design, and exploration with task geometry. SA-VLA fuses implicit spatial representations with visual tokens, provides dense rewards that reflect geometric progress, and employs \textbf{SCAN}, a spatially-conditioned annealed exploration strategy tailored to flow-matching dynamics. Across challenging multi-object and cluttered manipulation benchmarks, SA-VLA enables stable RL fine-tuning and improves zero-shot spatial generalization, yielding more robust and transferable behaviors. Code and project page are available at https://xupan.top/Projects/savla.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

spatial distribution shift

reinforcement learning fine-tuning

spatial inductive bias

flow-matching

Innovation

Methods, ideas, or system contributions that make the work stand out.

spatially-aware reinforcement learning

vision-language-action models

flow-matching