Shallow-{\pi}: Knowledge Distillation for Flow-based VLAs

📅 2026-01-28

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the lack of efficient deep compression methods for vision-language-action (VLA) models in real-time robotic deployment. We propose Shallow-π, a novel framework that, for the first time, integrates systematic Transformer layer pruning with knowledge distillation to simultaneously compress both the vision-language model backbone and the streaming action head, reducing the number of layers from 18 to 6. Evaluated on edge devices such as Jetson Orin and Thor, Shallow-π achieves over a 2× speedup in inference latency while incurring less than a 1% drop in task success rate on standard manipulation benchmarks. The method establishes a new state of the art for compressed VLA models, maintaining industrial-grade practicality without compromising performance.

Technology Category

Application Category

📝 Abstract

The growing demand for real-time robotic deployment necessitates fast and on-device inference for vision-language-action (VLA) models. Within the VLA literature, efficiency has been extensively studied at the token level, such as visual token pruning. In contrast, systematic transformer layer reduction has received limited attention and, to the best of our knowledge, has not been explored for flow-based VLA models under knowledge distillation. In this work, we propose Shallow-pi, a principled knowledge distillation framework that aggressively reduces the transformer depth of both the VLM backbone and the flow-based action head, compressing the model from 18 to 6 layers. Shallow-pi achieves over two times faster inference with less than one percent absolute drop in success rate on standard manipulation benchmarks, establishing state-of-the-art performance among reduced VLA models. Crucially, we validate our approach through industrial-scale real-world experiments on Jetson Orin and Jetson Thor across multiple robot platforms, including humanoid systems, in complex and dynamic manipulation scenarios.

Problem

Research questions and friction points this paper is trying to address.

vision-language-action

flow-based VLA

transformer depth reduction

on-device inference

real-time robotic deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

knowledge distillation

flow-based VLA

transformer depth reduction