🤖 AI Summary
End-to-end latency in wireless robotic behavior cloning for 6G distributed machine learning remains prohibitively high—especially under autoregressive control, where action drafts cannot be verified in parallel.
Method: We propose an action-deviation-aware hybrid inference framework that dynamically determines, based on predicted action deviation, whether remote verification and correction are necessary—enabling adaptive skipping of redundant communication and computation. The method integrates 6G ultra-reliable low-latency communication (URLLC), edge-cloud collaboration, a dual-model architecture (draft-target), and path-deviation threshold optimization, thereby overcoming the parallelism limitations of conventional speculative decoding in non-autoregressive control scenarios.
Results: Experiments demonstrate a 40% reduction in uplink transmission and server computation overhead, a 33.32% decrease in end-to-end latency, and a task success rate reaching 97.03% of that achieved by the pure target model.
📝 Abstract
To support latency-sensitive AI applications ranging from autonomous driving to industrial robot manipulation, 6G envisions distributed ML, connecting distributed computational resources in edge and cloud over hyper-reliable low-latency communication (HRLLC). In this setting, speculative decoding can facilitate collaborative inference of models distributively deployed: an on-device draft model locally generates drafts and a remote server-based target model verifies and corrects them, resulting lower latency. However, unlike autoregressive text generation, behavior cloning policies, typically used for embodied AI applications like robot manipulation and autonomous driving, cannot parallelize verification and correction for multiple drafts as each action depends on observation which needs to be updated by a previous action. To this end, we propose Action Deviation-Aware Hybrid Inference, wherein the draft model estimates an action's need for verification and correction by the target model and selectively skips communication and computation for server operations. Action deviation shows a strong correlation with action's rejection probability by the target model, enabling selective skipping. We derive the path deviation threshold that balances the transmission rate and the inference performance, and we empirically show that action deviation-aware hybrid inference reduces uplink transmission and server operation by 40%, while lowering end-to-end latency by 33.32% relative to hybrid inference without skipping and achieving task success rate up to 97.03% of that of target model only inference.