HyperVLA: Efficient Inference in Vision-Language-Action Models via Hypernetworks

📅 2025-10-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high inference cost and the difficulty of balancing multi-task capability with efficiency in existing Vision-Language-Action (VLA) models, this paper proposes a lightweight VLA architecture based on hypernetworks. Leveraging priors from vision foundation models, our method employs parameter decomposition, hypernetwork-based normalization, and conditional action decoding to activate only small, task-specific policy subnetworks during inference. The model is trained end-to-end on large-scale robotic datasets, substantially reducing computational overhead. Experiments demonstrate that, compared to monolithic VLA models, our approach achieves comparable or superior task success rates in zero-shot generalization and few-shot adaptation, while reducing inference parameter count by 90× and increasing throughput by 120×. To the best of our knowledge, this is the first work to achieve a substantive trade-off between accuracy and efficiency for multi-task VLA models.

Technology Category

Application Category

📝 Abstract
Built upon language and vision foundation models with strong generalization ability and trained on large-scale robotic data, Vision-Language-Action (VLA) models have recently emerged as a promising approach to learning generalist robotic policies. However, a key drawback of existing VLAs is their extremely high inference costs. In this paper, we propose HyperVLA to address this problem. Unlike existing monolithic VLAs that activate the whole model during both training and inference, HyperVLA uses a novel hypernetwork (HN)-based architecture that activates only a small task-specific policy during inference, while still retaining the high model capacity needed to accommodate diverse multi-task behaviors during training. Successfully training an HN-based VLA is nontrivial so HyperVLA contains several key algorithm design features that improve its performance, including properly utilizing the prior knowledge from existing vision foundation models, HN normalization, and an action generation strategy. Compared to monolithic VLAs, HyperVLA achieves a similar or even higher success rate for both zero-shot generalization and few-shot adaptation, while significantly reducing inference costs. Compared to OpenVLA, a state-of-the-art VLA model, HyperVLA reduces the number of activated parameters at test time by $90 imes$, and accelerates inference speed by $120 imes$. Code is publicly available at https://github.com/MasterXiong/HyperVLA
Problem

Research questions and friction points this paper is trying to address.

Reducing high inference costs in Vision-Language-Action models
Activating only task-specific policies during inference for efficiency
Maintaining high model capacity for diverse multi-task behaviors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hypernetwork architecture for task-specific inference activation
Utilizes vision foundation models prior knowledge effectively
Reduces activated parameters and accelerates inference speed
🔎 Similar Papers
No similar papers found.