🤖 AI Summary
Action noise—such as jitter and pauses—in human demonstrations degrades trajectory coherence in flow-matching-based vision-language-action (VLA) models, leading to deployment instability and failure in fine-grained manipulation. To address this, we propose a **training-free, test-time action coherence guidance method** that dynamically refines action sequences during inference to enhance smoothness and temporal consistency, significantly improving robustness to demonstration noise. Our approach is framework-agnostic, seamlessly integrating with both diffusion and flow-matching VLA architectures without introducing additional parameters or training overhead. We evaluate it on RoboCasa, DexMimicGen, and real-world SO-101 tasks, demonstrating substantial improvements in action coherence metrics and task success rates. The method provides a lightweight, general-purpose, plug-and-play stability enhancement for practical VLA deployment.
📝 Abstract
Diffusion and flow matching models have emerged as powerful robot policies, enabling Vision-Language-Action (VLA) models to generalize across diverse scenes and instructions. Yet, when trained via imitation learning, their high generative capacity makes them sensitive to noise in human demonstrations: jerks, pauses, and jitter which reduce action coherence. Reduced action coherence causes instability and trajectory drift during deployment, failures that are catastrophic in fine-grained manipulation where precision is crucial. In this paper, we present Action Coherence Guidance (ACG) for VLA models, a training-free test-time guidance algorithm that improves action coherence and thereby yields performance gains. Evaluated on RoboCasa, DexMimicGen, and real-world SO-101 tasks, ACG consistently improves action coherence and boosts success rates across diverse manipulation tasks. Code and project page are available at https://github.com/DAVIAN-Robotics/ACG and https://DAVIAN-Robotics.github.io/ACG , respectively.