What to Ignore, What to React: Visually Robust RL Fine-Tuning of VLA Models

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the vulnerability of vision-language-action (VLA) models to visual distribution shifts during deployment and their difficulty in distinguishing task-irrelevant from task-relevant visual changes. To this end, the authors propose PAIR-VLA, a novel framework that, for the first time, leverages paired visual variants as policy-level guidance signals. Within a PPO-based reinforcement learning fine-tuning scheme, PAIR-VLA explicitly models the two types of visual variations through an invariance loss—applied to task-preserving variants—and a sensitivity loss—applied to task-altering variants. Evaluated on the ManiSkill3 benchmark, PAIR-VLA improves the success rates of π₀.₅ and OpenVLA by 16.62% and 9.10%, respectively, while demonstrating strong generalization and transfer capabilities across diverse visual distribution shifts.

📝 Abstract

Reinforcement learning (RL) fine-tuning has shown promise for Vision-Language-Action (VLA) models in robotic manipulation, but deployment-time visual shifts pose practical challenges. A key difficulty is that standard task rewards supervise task success, but offer limited guidance on whether a visual change is task-irrelevant or changes the behavior required for manipulation. We propose PAIR-VLA (Paired Action Invariance & Sensitivity for Visually Robust VLA), an RL fine-tuning framework to address this difficulty by adding two auxiliary objectives over paired visual variants during PPO optimization: an invariance term that reduces the discrepancy between action distributions for a task-preserving pair (e.g., different distractors), and a sensitivity objective that encourages separable action distributions for a task-altering pair (e.g., target object in a different pose). Together, these objectives turn visual variants from mere observation diversity into behavior-level guidance on policy responses during RL fine-tuning. We evaluate on ManiSkill3 across two representative VLA architectures, OpenVLA and $π_{0.5}$, under diverse out-of-distribution visual shifts including unseen distractors, texture changes, target object pose variation, viewpoint shifts, and lighting changes. Our method consistently improves over standard PPO, achieving average improvements of 16.62% on $π_{0.5}$ and 9.10% on OpenVLA. Notably, ablations further show generalization across visual shifts: invariance guidance learned from distractor and texture variants transfers to target-pose and lighting shifts, while adding sensitivity guidance on target-pose variants further improves robustness to nuisance shifts, highlighting the broader transferability of behavior-level RL guidance.

Problem

Research questions and friction points this paper is trying to address.

Visually Robust RL

Vision-Language-Action Models

Visual Shifts

Task-Irrelevant Changes

Behavior-Level Guidance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visually Robust RL

VLA Models

Action Invariance

Action Sensitivity

Paired Visual Variants

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

2024-02-09European Conference on Computer VisionCitations: 29

💼 Related Jobs

AI Research Scientist, Robotics