🤖 AI Summary
Existing vision-language-action (VLA) models suffer from excessive parameter counts and high inference latency, rendering them unsuitable for dynamic robotic manipulation tasks requiring millisecond-level responsiveness. This paper proposes a hierarchical Robot Transformer architecture that introduces a novel frequency-decoupled control paradigm: a low-frequency branch leverages knowledge-distilled vision-language models (VLMs) for semantic feature extraction, while a high-frequency branch employs an end-to-end visual policy network for real-time action generation; a cross-frequency feature guidance mechanism further enables joint optimization. The design preserves VLM-level semantic understanding while doubling the effective control frequency. On static tasks, success rates remain unchanged; on dynamic manipulation tasks, success improves significantly—from 48% to 75%. To our knowledge, this is the first VLA framework achieving millisecond-scale dynamic interaction in real-world robotic settings.
📝 Abstract
Large Vision-Language-Action (VLA) models, leveraging powerful pre trained Vision-Language Models (VLMs) backends, have shown promise in robotic control due to their impressive generalization ability. However, the success comes at a cost. Their reliance on VLM backends with billions of parameters leads to high computational costs and inference latency, limiting the testing scenarios to mainly quasi-static tasks and hindering performance in dynamic tasks requiring rapid interactions. To address these limitations, this paper proposes HiRT, a Hierarchical Robot Transformer framework that enables flexible frequency and performance trade-off. HiRT keeps VLMs running at low frequencies to capture temporarily invariant features while enabling real-time interaction through a high-frequency vision-based policy guided by the slowly updated features. Experiment results in both simulation and real-world settings demonstrate significant improvements over baseline methods. Empirically, in static tasks, we double the control frequency and achieve comparable success rates. Additionally, on novel real-world dynamic ma nipulation tasks which are challenging for previous VLA models, HiRT improves the success rate from 48% to 75%.