Ada3Drift: Adaptive Training-Time Drifting for One-Step 3D Visuomotor Robotic Manipulation

📅 2026-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing single-step generative visuomotor policies often suffer from trajectory averaging and physically implausible behaviors due to difficulties in modeling multimodal action distributions. This work proposes a drift-field mechanism applied during training that shifts iterative refinement into the learning phase, enabling efficient recovery of multimodal action distributions within a single forward pass. By integrating multi-scale field aggregation with a progressive loss strategy governed by a sigmoid schedule, the method operates within a flow-matching framework to achieve high-fidelity 3D point-cloud-driven manipulation while maintaining real-time inference—requiring only a single function evaluation and offering a tenfold speedup over diffusion models. The approach establishes state-of-the-art performance across both simulated benchmarks (Adroit, Meta-World, RoboTwin) and real-world robotic tasks.

Technology Category

Application Category

📝 Abstract
Diffusion-based visuomotor policies effectively capture multimodal action distributions through iterative denoising, but their high inference latency limits real-time robotic control. Recent flow matching and consistency-based methods achieve single-step generation, yet sacrifice the ability to preserve distinct action modes, collapsing multimodal behaviors into averaged, often physically infeasible trajectories. We observe that the compute budget asymmetry in robotics (offline training vs.\ real-time inference) naturally motivates recovering this multimodal fidelity by shifting iterative refinement from inference time to training time. Building on this insight, we propose Ada3Drift, which learns a training-time drifting field that attracts predicted actions toward expert demonstration modes while repelling them from other generated samples, enabling high-fidelity single-step generation (1 NFE) from 3D point cloud observations. To handle the few-shot robotic regime, Ada3Drift further introduces a sigmoid-scheduled loss transition from coarse distribution learning to mode-sharpening refinement, and multi-scale field aggregation that captures action modes at varying spatial granularities. Experiments on three simulation benchmarks (Adroit, Meta-World, and RoboTwin) and real-world robotic manipulation tasks demonstrate that Ada3Drift achieves state-of-the-art performance while requiring $10\times$ fewer function evaluations than diffusion-based alternatives.
Problem

Research questions and friction points this paper is trying to address.

visuomotor policy
multimodal action distribution
real-time robotic control
single-step generation
3D point cloud
Innovation

Methods, ideas, or system contributions that make the work stand out.

training-time drifting
single-step generation
multimodal visuomotor policy
3D point cloud
flow matching
C
Chongyang Xu
College of Computer Science, Sichuan University, Chengdu, China
Y
Yixian Zou
School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, China
Z
Ziliang Feng
College of Computer Science, Sichuan University, Chengdu, China
F
Fanman Meng
School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, China
Shuaicheng Liu
Shuaicheng Liu
University of Electronic Science and Technology of China
Computer VisionComputational Photography