Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment

📅 2026-04-27

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

This work addresses the challenges of deploying Vision-Language-Action (VLA) models on edge robots, where real-time performance, cost, and energy constraints are critical, yet existing evaluations predominantly rely on desktop GPUs and overlook the potential of heterogeneous edge accelerators. The authors propose a model-hardware co-characterization methodology to construct the first cross-accelerator VLA performance benchmark, revealing a two-stage bottleneck: compute-intensive visual-language backbones and memory-intensive action experts. To mitigate these bottlenecks, they introduce DP-Cache and V-AEFusion optimization strategies that enable asynchronous pipelined parallelism. Experiments demonstrate speedups of 2.9× on GPUs and up to 6× on edge NPUs, with only marginal degradation in task success rates, thereby validating the feasibility of efficient, low-cost VLA deployment on resource-constrained edge platforms.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models are promising for generalist robot control, but on-robot deployment is bottlenecked by real-time inference under tight cost and energy budgets. Most prior evaluations rely on desktop-grade GPUs, obscuring the trade-offs and opportunities offered by heterogeneous edge accelerators (GPUs/XPUs/NPUs). We present a systematic analysis for low-cost VLA deployment via model-hardware co-characterization. First, we build a cross-accelerator leaderboard and evaluate model-hardware pairs under CET (Cost, Energy, Time), showing that right-sized edge devices can be more cost-/energy-efficient than flagship GPUs while meeting control-rate constraints. Second, using in-depth profiling, we uncover a consistent two-phase inference pattern: a compute-bound VLM backbone followed by a memory-bound Action Expert, which induces phase-dependent underutilization and hardware inefficiency. Finally, guided by these insights, we propose DP-Cache and V-AEFusion to reduce diffusion redundancy and enable asynchronous pipeline parallelism, achieving up to 2.9x speedup on GPUs and 6x on edge NPUs with only marginal success degradation. The example leaderboard website is available at: https://vla-leaderboard-01.vercel.app/.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action models

on-robot deployment

edge accelerators

real-time inference

cost-energy-time constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action models

edge accelerators

model-hardware co-characterization