LiteVLA-H: Dual-Rate Vision-Language-Action Inference for Onboard Aerial Guidance and Semantic Perception

📅 2026-04-27

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Constrained by limited onboard computation and communication resources, unmanned aerial vehicles struggle to deploy high-latency vision-language-action (VLA) models that simultaneously support low-latency control and semantic perception. This work proposes LiteVLA-H, a compact dual-rate VLA system that integrates a dual-rate inference architecture with a tailored scheduling mechanism to jointly achieve high-speed action output at 19.74 Hz (50.65 ms latency) and low-frequency semantic understanding at 6.08–6.67 Hz on an NVIDIA Jetson AGX Orin platform. Leveraging a lightweight 256M-parameter model, multimodal prefilling optimization, and a hybrid data fine-tuning strategy—combining flight control signals, aerial semantics, and general image-text supervision—the system substantially outperforms state-of-the-art approaches such as AnywhereVLA while preserving strong semantic description capabilities.

📝 Abstract

Vision-language-action (VLA) models have shown strong semantic grounding and task generalization in manipulation, but aerial deployment remains difficult because drones require low-latency closed-loop guidance under strict onboard compute and communication constraints. We present LiteVLA-H, a compact 256M-parameter VLA system designed for dual-rate operation on an NVIDIA Jetson AGX Orin: a fast outer-loop guidance mode for short action-token outputs and a slower semantic mode for scene understanding, hazard description, and operator-facing narration. The central empirical observation is that, in this compact edge regime, end-to-end latency is dominated by multimodal pre-fill rather than by the marginal cost of decoding a few extra tokens. This motivates a scheduler that issues reactive action tokens at 50.65,ms (19.74,Hz) while still supporting sentence-level semantic outputs at 149.90--164.57\ms (6.08--6.67,Hz) on the same embedded platform. To specialize the model without collapsing its descriptive competence, we use a knowledge-preserving fine-tuning recipe that mixes reactive flight data, aerial semantic data, and generic caption/VQA supervision. Beyond reporting current latency measurements, we position the system against recent state-of-the-art architectures, including AnywhereVLA, FutureVLA, and ReMem-VLA, showing that the measured action branch reaches a higher edge inference rate under our deployment conditions while retaining periodic semantic awareness.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

onboard aerial guidance

low-latency inference

semantic perception

edge deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

dual-rate inference

onboard aerial guidance

vision-language-action model