Not All Features Are Created Equal: A Mechanistic Study of Vision-Language-Action Models

📅 2026-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates how vision–language–action (VLA) models translate multimodal inputs into concrete motor actions, uncovering their internal mechanisms through interpretable AI techniques—including activation injection, sparse autoencoders, linear probing, and causal ablation—applied systematically across six VLA models of increasing scale and 394,000 robot trajectories. The analysis reveals that action generation is predominantly driven by the visual pathway, with language playing a critical role only when visual input is ambiguous. Furthermore, action programs are grounded in scene coordinates rather than abstract task representations, and distinct subspaces within the model activations separately encode motor programs (via expert pathways) and goal semantics (via vision–language model pathways). These findings are validated on four benchmarks, and the authors release Action Atlas, an interactive tool for visualizing VLA representations.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models combine perception, language, and motor control in a single architecture, yet how they translate multimodal inputs into actions remains poorly understood. We apply activation injection, sparse autoencoders (SAEs), and linear probes to six models spanning 80M--7B parameters across 394,000+ rollout episodes on four benchmarks. The visual pathway dominates action generation across all architectures: injecting baseline activations into null-prompt episodes recovers near-identical behavior, while cross-task injection steers robots toward source-task positions (99.8\% of X-VLA episodes align with the source trajectory), exposing spatially bound motor programs tied to scene coordinates rather than abstract task representations. Language sensitivity depends on task structure, not model design: when visual context uniquely specifies the task, language is ignored; when multiple goals share a scene, language becomes essential (X-VLA \texttt{libero\_goal}: 94\%$\to$10\% under wrong prompts vs.\ \texttt{libero\_object}: 60--100\% regardless). In all three multi-pathway architectures (\pizhalf{}, SmolVLA, GR00T), expert pathways encode motor programs while VLM pathways encode goal semantics ($2\times$ greater behavioral displacement from expert injection), and subspace injection confirms these occupy separable activation subspaces. Per-token SAE processing is essential for action fidelity on most architectures, though mean-pooling improves fidelity on X-VLA. Contrastive identification recovers 82+ manipulation concepts, and causal ablation reveals sensitivity spanning 28--92\% zero-effect rates independent of representation width. We release \textbf{Action Atlas} (https://action-atlas.com) for interactive exploration of VLA representations across all six models.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action models
multimodal action generation
feature contribution
language sensitivity
motor programs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action Models
Mechanistic Interpretability
Activation Injection
Sparse Autoencoders
Multimodal Subspace Separation
🔎 Similar Papers
No similar papers found.
B
Bryce Grant
Case Western Reserve University
X
Xijia Zhao
Case Western Reserve University
Peng Wang
Peng Wang
School of Computer Science, Northwestern Polytechnical University, China
Computer VisionMachine LearningArtificial Intelligence