Neural Implicit Action Fields: From Discrete Waypoints to Continuous Functions for Vision-Language-Action Models

📅 2026-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language-action models rely on discrete waypoint prediction, which struggles to capture the continuity of physical motion, leading to limited sampling resolution, lack of higher-order differentiability, and quantization artifacts. This work proposes the Neural Implicit Action Field (NIAF), which, for the first time, formulates action representation as a continuous and differentiable function. NIAF leverages a multimodal large language model as a hierarchical spectral modulator to generate trajectories of effectively infinite resolution atop a learnable motion prior. The framework enables explicit supervision over velocity, acceleration, and jerk, thereby achieving seamless integration between semantic understanding and dynamic execution. It attains state-of-the-art performance on the CALVIN and LIBERO benchmarks and demonstrates robust impedance control in real-world robotic experiments.

Technology Category

Application Category

📝 Abstract
Despite the rapid progress of Vision-Language-Action (VLA) models, the prevailing paradigm of predicting discrete waypoints remains fundamentally misaligned with the intrinsic continuity of physical motion. This discretization imposes rigid sampling rates, lacks high-order differentiability, and introduces quantization artifacts that hinder precise, compliant interaction. We propose Neural Implicit Action Fields (NIAF), a paradigm shift that reformulates action prediction from discrete waypoints to continuous action function regression. By utilizing an MLLM as a hierarchical spectral modulator over a learnable motion prior, NIAF synthesizes infinite-resolution trajectories as continuous-time manifolds. This formulation enables analytical differentiability, allowing for explicit supervision of velocity, acceleration, and jerk to ensure mathematical consistency and physical plausibility. Our approach achieves state-of-the-art results on CALVIN and LIBERO benchmarks across diverse backbones. Furthermore, real-world experiments demonstrate that NIAF enables stable impedance control, bridging the gap between high-level semantic understanding and low-level dynamic execution.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action models
discrete waypoints
continuous motion
quantization artifacts
physical plausibility
Innovation

Methods, ideas, or system contributions that make the work stand out.

Neural Implicit Action Fields
continuous action representation
vision-language-action models
analytical differentiability
motion prior
🔎 Similar Papers
No similar papers found.
H
Haoyun Liu
School of Computer Science, Nanjing University
Jianzhuang Zhao
Jianzhuang Zhao
Istituto Italiano di Tecnologia
Mobile manipulationRobot learningHuman robot collaboration
Xinyuan Chang
Xinyuan Chang
Xi'an Jiaotong University; Alibaba-Amap
Autonomous Driving,Computer Vision
T
Tianle Shi
Shenzhen University of Advanced Technology
C
Chuanzhang Meng
Shenzhen University of Advanced Technology
J
Jiayuan Tan
Shenzhen University of Advanced Technology
Feng Xiong
Feng Xiong
Alibaba-inc
Computer Vision
T
Tong Lin
Xi’an Jiao-tong University
D
Dongjie Huo
Beijing University of Chemical Technology
Mu Xu
Mu Xu
alibaba
CV LLM VLM VLA RL
S
SongLin Dong
Shenzhen University of Advanced Technology
Z
Zhiheng Ma
Shenzhen University of Advanced Technology
Yihong Gong
Yihong Gong
Xi'an Jiaotong University
Multimedia content analysisMachine learningPattern recognition
Sheng Zhong
Sheng Zhong
Nanjing University
computer networkssecurity and privacytheory of computing