ElegantVLA: Learning When to Think for Efficient Vision-Language-Action Models

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language-action (VLA) models struggle to meet the low-latency and high-frequency demands of real-time robotic control due to their computationally expensive per-step inference. This work proposes a plug-and-play, phase-adaptive inference framework that introduces, for the first time, a dynamic computation scheduling mechanism based on the cognitive demands of task phases. A lightweight scheduler leverages visual-language representation stability, motion state, and task progress to dynamically select among five levels of visual-language computation and three levels of denoising action generation, deciding whether to reuse prior computations. The approach accelerates perception, language reasoning, and action generation in concert without modifying or retraining the base model. It achieves up to 2.55× and 3.77× speedup on GR00T and CogACT, respectively, reduces computational load by 2.18× across six real-world tasks, and increases control frequency from 13.8 Hz to 26.3 Hz.
📝 Abstract
Vision-Language-Action (VLA) models are a powerful paradigm for generalist robotic control. However, their high computational cost and limited control frequency hinder real-time robotic manipulation, especially when large vision-language backbones and iterative action heads run at every control step. Existing VLA acceleration methods often optimize individual components or rely on fixed acceleration rules, treating different control steps with largely fixed computation and overlooking the non-uniform reasoning demands of sequential embodied control. Inspired by human motor control, where cognitive and feedback resources concentrate on goal-sensitive stages, we argue that VLA models should learn when to invest full computation and when to reuse prior computation. We propose ElegantVLA, a plug-in phase-adaptive inference framework that accelerates VLA models through intra-model dynamic compute scheduling. ElegantVLA introduces a lightweight scheduler that observes temporal representation similarity, robot-motion cues, and episode progress to jointly allocate computation across the vision encoder, LLM, and action head. For perception-language reasoning, the scheduler selects a five-level Vision-LLM compute mode, from full recomputation to multi-step temporal reuse, based on visual-language representation stability. For action generation, it selects a three-level denoising mode, reusing intermediate denoising states during stable motion while preserving full refinement for goal-sensitive stages. By coordinating these decisions, ElegantVLA offers a general acceleration framework for modern VLA pipelines with explicit action-generation modules, without modifying or retraining the base model. Experiments on GR00T and CogACT achieve up to 2.55x and 3.77x speedup, and on six real-world GR00T tasks ElegantVLA cuts computation by 2.18x while raising control frequency from 13.8 Hz to 26.3 Hz.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action models
real-time robotic control
computational efficiency
dynamic compute scheduling
control frequency
Innovation

Methods, ideas, or system contributions that make the work stand out.

dynamic compute scheduling
phase-adaptive inference
vision-language-action models
temporal reuse
efficient robotic control
🔎 Similar Papers