Mechanistic Finetuning of Vision-Language-Action Models via Few-Shot Demonstrations

📅 2025-11-27

📈 Citations: 0

✨ Influential: 0

career value

253K/year

🤖 AI Summary

Existing vision-language-action (VLA) models suffer from limited generalizability and interpretability in robotic deployment due to entity discrepancies, environmental variability, and diverse spatial relationships—necessitating task-specific fine-tuning yet lacking functional disentanglement across visual, linguistic, and physical modalities. To address this, we propose a mechanism-driven few-shot fine-tuning method grounded in mechanistic interpretability: we identify task-relevant sparse neural representations via causal mediation analysis and selectively update only task-critical attention heads. Integrated with Robotic Steering—a structured policy refinement strategy—the approach is validated on a Franka Emika robotic arm. Compared to LoRA, our method improves task adaptation robustness by +12.3% in success rate, reduces trainable parameter updates by 37%, and yields human-interpretable, traceable action decisions rooted in identifiable neural mechanisms.

Technology Category

Application Category

📝 Abstract

Vision-Language Action (VLAs) models promise to extend the remarkable success of vision-language models (VLMs) to robotics. Yet, unlike VLMs in the vision-language domain, VLAs for robotics require finetuning to contend with varying physical factors like robot embodiment, environment characteristics, and spatial relationships of each task. Existing fine-tuning methods lack specificity, adapting the same set of parameters regardless of a task's visual, linguistic, and physical characteristics. Inspired by functional specificity in neuroscience, we hypothesize that it is more effective to finetune sparse model representations specific to a given task. In this work, we introduce Robotic Steering, a finetuning approach grounded in mechanistic interpretability that leverages few-shot demonstrations to identify and selectively finetune task-specific attention heads aligned with the physical, visual, and linguistic requirements of robotic tasks. Through comprehensive on-robot evaluations with a Franka Emika robot arm, we demonstrate that Robotic Steering outperforms LoRA while achieving superior robustness under task variation, reduced computational cost, and enhanced interpretability for adapting VLAs to diverse robotic tasks.

Problem

Research questions and friction points this paper is trying to address.

Fine-tune VLAs for robotic tasks with few-shot demonstrations

Selectively adapt task-specific attention heads for physical factors

Enhance robustness and interpretability in robotic vision-language-action models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Few-shot demonstrations identify task-specific attention heads

Selectively finetune sparse model representations for robotics

Mechanistic interpretability enhances VLA adaptation efficiency

🔎 Similar Papers

No similar papers found.

Toyota Research Institute

Los Altos, CA / Cambridge, MA

Promotion (PhD): KI-basierte Lernstrategien für Smart Manufacturing im europäischen HORIZON-Projekt

Bosch Group

ARENA2036 in Stuttgart

AI Research Scientist, Robotics