Not All Features Are Created Equal: A Mechanistic Study of Vision-Language-Action Models

📅 2026-03-19

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This study investigates how vision–language–action (VLA) models translate multimodal inputs into concrete motor actions, uncovering their internal mechanisms through interpretable AI techniques—including activation injection, sparse autoencoders, linear probing, and causal ablation—applied systematically across six VLA models of increasing scale and 394,000 robot trajectories. The analysis reveals that action generation is predominantly driven by the visual pathway, with language playing a critical role only when visual input is ambiguous. Furthermore, action programs are grounded in scene coordinates rather than abstract task representations, and distinct subspaces within the model activations separately encode motor programs (via expert pathways) and goal semantics (via vision–language model pathways). These findings are validated on four benchmarks, and the authors release Action Atlas, an interactive tool for visualizing VLA representations.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models combine perception, language, and motor control in a single architecture, yet how they translate multimodal inputs into actions remains poorly understood. We apply activation injection, sparse autoencoders (SAEs), and linear probes to six models spanning 80M--7B parameters across 394,000+ rollout episodes on four benchmarks. The visual pathway dominates action generation across all architectures: injecting baseline activations into null-prompt episodes recovers near-identical behavior, while cross-task injection steers robots toward source-task positions (99.8\% of X-VLA episodes align with the source trajectory), exposing spatially bound motor programs tied to scene coordinates rather than abstract task representations. Language sensitivity depends on task structure, not model design: when visual context uniquely specifies the task, language is ignored; when multiple goals share a scene, language becomes essential (X-VLA \texttt{libero\_goal}: 94\%$\to$10\% under wrong prompts vs.\ \texttt{libero\_object}: 60--100\% regardless). In all three multi-pathway architectures (\pizhalf{}, SmolVLA, GR00T), expert pathways encode motor programs while VLM pathways encode goal semantics ($2\times$ greater behavioral displacement from expert injection), and subspace injection confirms these occupy separable activation subspaces. Per-token SAE processing is essential for action fidelity on most architectures, though mean-pooling improves fidelity on X-VLA. Contrastive identification recovers 82+ manipulation concepts, and causal ablation reveals sensitivity spanning 28--92\% zero-effect rates independent of representation width. We release \textbf{Action Atlas} (https://action-atlas.com) for interactive exploration of VLA representations across all six models.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action models

multimodal action generation

feature contribution

language sensitivity

motor programs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action Models

Mechanistic Interpretability

Activation Injection