ActDistill: General Action-Guided Self-Derived Distillation for Efficient Vision-Language-Action Models

📅 2025-11-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current Vision-Language-Action (VLA) models suffer from high computational overhead and excessive inference latency, hindering real-time deployment in robotic manipulation. To address this, we propose an action-guided self-distillation framework—the first efficient VLA model compression paradigm explicitly optimized for action execution. Our method introduces an action-prior-driven knowledge distillation mechanism, coupled with a graph-structured encapsulation module and a dynamic routing architecture, enabling precise transfer of the teacher’s action-decision capability to a lightweight student model. It integrates hierarchical supervision, dynamic computation path selection, and structured knowledge representation. Evaluated on embodied AI benchmarks, the distilled student model matches or surpasses the full-scale teacher in task performance while reducing computational cost by over 50% and accelerating inference by 1.67×.

Technology Category

Application Category

📝 Abstract
Recent Vision-Language-Action (VLA) models have shown impressive flexibility and generalization, yet their deployment in robotic manipulation remains limited by heavy computational overhead and inference latency. In this work, we present ActDistill, a general action-guided self-derived distillation framework that transfers the action prediction capability of any existing VLA model to a lightweight counterpart. Unlike previous efficiency strategies that primarily emphasize vision-language correlations, ActDistill leverages action priors to guide knowledge transfer and model compression, achieving action-oriented efficiency for VLA models. Specifically, we employ a well-trained VLA model as the teacher and introduce a graph-structured encapsulation strategy to explicitly model the hierarchical evolution of action prediction. The student model, derived from the graph-encapsulated teacher, is further equipped with a dynamic router that adaptively selects computation paths based on action prediction demands, guided by hierarchical graph-informed supervision to ensure smooth and efficient evolution. During inference, graph-related auxiliary components are removed, allowing the student to execute only dynamically routed layers and predict high-precision actions with minimal computation and latency. Experiments on embodied benchmarks demonstrate that ActDistill achieves comparable or superior performance to full-scale VLA models while reducing computation by over 50% with up to 1.67 times speedup, thereby establishing a general paradigm toward efficient embodied intelligence.
Problem

Research questions and friction points this paper is trying to address.

Reduces computational overhead in vision-language-action models for robotics
Transfers action prediction capabilities to lightweight models via distillation
Enables efficient inference with dynamic routing and hierarchical supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Action-guided distillation transfers VLA capabilities to lightweight models
Graph-structured encapsulation models hierarchical action prediction evolution
Dynamic router selects computation paths based on action demands
🔎 Similar Papers
No similar papers found.
W
Wencheng Ye
Tongji University
T
Tianshi Wang
Tongji University
L
Lei Zhu
Tongji University
Fengling Li
Fengling Li
University of Technology Sydney
Cross-modal AnalysisDomain AdaptationMultimodal Learning
G
Guoli Yang
Advanced Institute of Big Data