🤖 AI Summary
Existing vision-language-action models rely on discrete waypoint prediction, which struggles to capture the continuity of physical motion, leading to limited sampling resolution, lack of higher-order differentiability, and quantization artifacts. This work proposes the Neural Implicit Action Field (NIAF), which, for the first time, formulates action representation as a continuous and differentiable function. NIAF leverages a multimodal large language model as a hierarchical spectral modulator to generate trajectories of effectively infinite resolution atop a learnable motion prior. The framework enables explicit supervision over velocity, acceleration, and jerk, thereby achieving seamless integration between semantic understanding and dynamic execution. It attains state-of-the-art performance on the CALVIN and LIBERO benchmarks and demonstrates robust impedance control in real-world robotic experiments.
📝 Abstract
Despite the rapid progress of Vision-Language-Action (VLA) models, the prevailing paradigm of predicting discrete waypoints remains fundamentally misaligned with the intrinsic continuity of physical motion. This discretization imposes rigid sampling rates, lacks high-order differentiability, and introduces quantization artifacts that hinder precise, compliant interaction. We propose Neural Implicit Action Fields (NIAF), a paradigm shift that reformulates action prediction from discrete waypoints to continuous action function regression. By utilizing an MLLM as a hierarchical spectral modulator over a learnable motion prior, NIAF synthesizes infinite-resolution trajectories as continuous-time manifolds. This formulation enables analytical differentiability, allowing for explicit supervision of velocity, acceleration, and jerk to ensure mathematical consistency and physical plausibility. Our approach achieves state-of-the-art results on CALVIN and LIBERO benchmarks across diverse backbones. Furthermore, real-world experiments demonstrate that NIAF enables stable impedance control, bridging the gap between high-level semantic understanding and low-level dynamic execution.