VADF: Vision-Adaptive Diffusion Policy Framework for Efficient Robotic Manipulation

📅 2026-04-17
📈 Citations: 0
Influential: 0
📄 PDF

career value

226K/year
🤖 AI Summary
This work addresses the slow training convergence and excessive inference latency commonly observed in existing diffusion-based robotic policies, which stem from uniform sampling strategies and a lack of awareness of sample difficulty. To overcome these limitations, we propose VADF—the first model-agnostic, vision-based adaptive diffusion policy framework—featuring two key components: an Adaptive Loss Network (ALN) that employs a lightweight MLP to dynamically evaluate and reweight hard negative samples during training, and a Hierarchical Vision-based Task Segmenter (HVTS) that decomposes visual instructions to adaptively modulate noise scheduling and action sequence length. Our approach substantially accelerates training convergence, significantly reduces computational overhead at inference time, and markedly improves early-stage task success rates.

Technology Category

Application Category

📝 Abstract
Diffusion policies are becoming mainstream in robotic manipulation but suffer from hard negative class imbalance due to uniform sampling and lack of sample difficulty awareness, leading to slow training convergence and frequent inference timeout failures. We propose VADF (Vision-Adaptive Diffusion Policy Framework), a vision-driven dual-adaptive framework that significantly reduces convergence steps and achieves early success in inference, with model-agnostic design enabling seamless integration into any diffusion policy architecture. During training, we introduce Adaptive Loss Network (ALN), a lightweight MLP-based loss predictor that quantifies per-step sample difficulty in real time. Guided by hard negative mining, it performs weighted sampling to prioritize high-loss regions, enabling adaptive weight updates and faster convergence. In inference, we design the Hierarchical Vision Task Segmenter (HVTS), which decomposes high-level task instructions into multi-stage low-level sub-instructions based on visual input. It adaptively segments action sequences into simple and complex subtasks by assigning shorter noise schedules with longer direct execution sequences to simple actions, and longer noise steps with shorter execution sequences to complex ones, thereby dramatically reducing computational overhead and significantly improving the early success rate.
Problem

Research questions and friction points this paper is trying to address.

diffusion policies
class imbalance
robotic manipulation
training convergence
inference timeout
Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion policy
adaptive sampling
vision-based manipulation
hard negative mining
noise schedule adaptation