🤖 AI Summary
Deploying large models on edge robots often leads to high inference latency, disrupting control loops and hindering real-time, safe navigation. To address this challenge, this work proposes AsyncVLA (Asynchronous Vision-Language-Action), a novel framework that decouples semantic planning from a remote large model and high-frequency action execution by a lightweight local Edge Adapter. Through end-to-end fine-tuning and a trajectory reweighting strategy, AsyncVLA effectively bridges the domain gap between high-level semantic instructions and low-level dynamic execution. Evaluated on real-world visual navigation tasks with communication delays up to six seconds, the method achieves a 40% higher success rate compared to the current state-of-the-art baseline, substantially improving both real-time responsiveness and system robustness.
📝 Abstract
Robotic foundation models achieve strong generalization by leveraging internet-scale vision-language representations, but their massive computational cost creates a fundamental bottleneck: high inference latency. In dynamic environments, this latency breaks the control loop, rendering powerful models unsafe for real-time deployment. We propose AsyncVLA, an asynchronous control framework that decouples semantic reasoning from reactive execution. Inspired by hierarchical control, AsyncVLA runs a large foundation model on a remote workstation to provide high-level guidance, while a lightweight, onboard Edge Adapter continuously refines actions at high frequency. To bridge the domain gap between these asynchronous streams, we introduce an end-to-end finetuning protocol and a trajectory re-weighting strategy that prioritizes dynamic interactions. We evaluate our approach on real-world vision-based navigation tasks with communication delays up to 6 seconds. AsyncVLA achieves a 40% higher success rate than state-of-the-art baselines, effectively bridging the gap between the semantic intelligence of large models and the reactivity required for edge robotics.