🤖 AI Summary
Existing generative visuomotor policies rely on multi-step sampling, resulting in high inference latency and failing to meet real-time robotic manipulation requirements. The core challenge lies in enforcing strong temporal continuity and structural consistency in action trajectories—properties that image-acceleration techniques cannot directly transfer. This paper proposes the first frequency-consistent modeling paradigm for streaming visuomotor policies, innovatively introducing temporal frequency-domain consistency constraints and an adaptive consistency loss to explicitly model dynamic trajectory continuity. Our method builds upon invertible flow models, integrating frequency-domain feature alignment, adaptive weighted loss, and end-to-end Vision-Language-Action (VLA) integration. Evaluated on 53 simulated tasks, it surpasses state-of-the-art single-step action generators. When integrated into a VLA framework, it achieves inference acceleration on Libero-40 with zero performance degradation. On physical hardware, it operates at 93.5 Hz.
📝 Abstract
Generative modeling-based visuomotor policies have been widely adopted in robotic manipulation attributed to their ability to model multimodal action distributions. However, the high inference cost of multi-step sampling limits their applicability in real-time robotic systems. To address this issue, existing approaches accelerate the sampling process in generative modeling-based visuomotor policies by adapting acceleration techniques originally developed for image generation. Despite this progress, a major distinction remains: image generation typically involves producing independent samples without temporal dependencies, whereas robotic manipulation involves generating time-series action trajectories that require continuity and temporal coherence. To effectively exploit temporal information in robotic manipulation, we propose FreqPolicy, a novel approach that first imposes frequency consistency constraints on flow-based visuomotor policies. Our work enables the action model to capture temporal structure effectively while supporting efficient, high-quality one-step action generation. We introduce a frequency consistency constraint that enforces alignment of frequency-domain action features across different timesteps along the flow, thereby promoting convergence of one-step action generation toward the target distribution. In addition, we design an adaptive consistency loss to capture structural temporal variations inherent in robotic manipulation tasks. We assess FreqPolicy on 53 tasks across 3 simulation benchmarks, proving its superiority over existing one-step action generators. We further integrate FreqPolicy into the vision-language-action (VLA) model and achieve acceleration without performance degradation on the 40 tasks of Libero. Besides, we show efficiency and effectiveness in real-world robotic scenarios with an inference frequency 93.5Hz. The code will be publicly available.