Boosting Vision-Language-Action Finetuning with Feasible Action Neighborhood Prior

📅 2026-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language-action (VLA) fine-tuning methods assume a one-to-one mapping between states and actions, disregarding the physical reality in robotic manipulation that a feasible action neighborhood (FAN)—a continuous set of valid actions—often exists for a given state. This oversight leads to poor generalization and low sample efficiency. To address this, this work introduces the geometric structure of the FAN as an explicit physical prior into VLA fine-tuning, proposing a Gaussian prior regularization mechanism that aligns the model’s output distribution with the FAN. This encourages locally smooth, unimodal action predictions centered around preferred directions and magnitudes. The approach seamlessly integrates with both supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT), significantly improving sample efficiency and task success rates on both in-distribution and out-of-distribution tasks, thereby demonstrating the effectiveness and generality of FAN-guided regularization.
📝 Abstract
In real-world robotic manipulation, states typically admit a neighborhood of near-equivalent actions. That is, for each state, there exist a feasible action neighborhood (FAN) rather than a single correct action, within which motions yield indistinguishable progress. However, prevalent VLA training methodologies are directly inherited from linguistic settings and do not exploit the FAN property, thus leading to poor generalization and low sample efficiency. To address this limitation, we introduce a FAN-guided regularizer that shapes the model's output distribution to align with the geometry of FAN. Concretely, we introduce a Gaussian prior that promotes locally smooth and unimodal predictions around the preferred direction and magnitude. In extensive experiments across both reinforced finetuning (RFT) and supervised finetuning (SFT), our method achieves significant improvement in sample efficiency, and success rate in both in-distribution and out-of-distribution (OOD) scenarios. By aligning with the intrinsic action tolerance of physical manipulation, FAN-guided regularization provides a principled and practical method for sample-efficient, and generalizable VLA adaptation.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
Feasible Action Neighborhood
Sample Efficiency
Generalization
Robotic Manipulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Feasible Action Neighborhood
Vision-Language-Action
Regularization
Sample Efficiency
Generalization
🔎 Similar Papers
No similar papers found.
H
Haochen Niu
Shanghai Jiao Tong University, China
K
Kanyu Zhang
Shanghai Jiao Tong University, China
S
Shuyu Yin
Shanghai Jiao Tong University, China
Q
Qinghai Guo
Huawei Technologies, China
P
Peilin Liu
Shanghai Jiao Tong University, China
Fei Wen
Fei Wen
Department of Electronic Engineering, Shanghai Jiao Tong University
machine learningrobotic navigation