🤖 AI Summary
This study addresses the severe latency challenges of deploying Vision-Language-Action (VLA) models on edge devices, which hinder real-time performance. Focusing on the MolmoAct-7B model, we conduct a systematic performance characterization on NVIDIA Jetson Orin and Thor platforms and identify, for the first time, that the action generation phase constitutes a memory-bound bottleneck, accounting for up to 75% of end-to-end latency. Leveraging analytical modeling and simulation-based projection, we prospectively evaluate the potential of high-bandwidth memory (HBM) and processing-in-memory (PIM) architectures to support future VLA models with tens of billions of parameters, thereby quantifying the critical hardware capabilities required for next-generation edge AI systems.
📝 Abstract
Vision-Language-Action (VLA) models are an emerging class of workloads critical for robotics and embodied AI at the edge. As these models scale, they demonstrate significant capability gains, yet they must be deployed locally to meet the strict latency requirements of real-time applications. This paper characterizes VLA performance on two generations of edge hardware, viz. the Nvidia Jetson Orin and Thor platforms. Using MolmoAct-7B, a state-of-the-art VLA model, we identify a primary execution bottleneck: up to 75% of end-to-end latency is consumed by the memory-bound action-generation phase. Through analytical modeling and simulations, we project the hardware requirements for scaling to 100B parameter models. We also explore the impact of high-bandwidth memory technologies and processing-in-memory (PIM) as promising future pathways in edge systems for embodied AI.