Hydra-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control

📅 2026-05-02
📈 Citations: 0
Influential: 0
📄 PDF

career value

212K/year
🤖 AI Summary
This work addresses the redundancy and high inference latency of existing diffusion-based visuomotor policies, which rely on image-generative decoders and multi-step sampling. For the first time, it introduces frequency-domain analysis into diffusion policy design and observes that robot action trajectories concentrate their energy in a few low-frequency Discrete Cosine Transform (DCT) modes. Leveraging this insight, the authors propose an ultra-lightweight 3D diffusion policy that employs only a lightweight Diffusion Mixer decoder and two-step DDIM sampling. The method achieves state-of-the-art performance across RoboTwin2.0, Adroit, MetaWorld, and real-robot tasks while using less than 1% of the parameters of prior approaches, substantially reducing inference latency.
📝 Abstract
Diffusion-based visuomotor policies perform well in robotic manipulation, yet current methods still inherit image-generation-style decoders and multi-step sampling. We revisit this design from a frequency-domain perspective. Robot action trajectories are highly smooth, with most energy concentrated in a few low-frequency discrete cosine transform modes. Under this structure, we show that the error of the optimal denoiser is bounded by the low-frequency subspace dimension and residual high-frequency energy, implying that denoising error saturates after very few reverse steps. This further suggests that action denoising requires a much simpler denoising model than image generation. Motivated by this insight, we propose Hydra-DP3(HDP3), a pocket-scale 3D diffusion policy with a lightweight Diffusion Mixer decoder that supports two-step DDIM inference. Our synthetic experiments validate the theory and support the sufficiency of two-step denoising. Futhermore, across RoboTwin2.0, Adroit, MetaWorld, and real-world tasks, HDP3 achieves state-of-the-art performance with fewer than 1% of the parameters of prior 3D diffusion-based policies and substantially lower inference latency.
Problem

Research questions and friction points this paper is trying to address.

diffusion policy
visuomotor control
frequency-aware
right-sizing
robotic manipulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

frequency-aware
diffusion policy
visuomotor control
lightweight decoder
two-step denoising