RoDiF: Robust Direct Fine-Tuning of Diffusion Policies with Corrupted Human Feedback

📅 2026-01-31

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

This work addresses the sensitivity of existing diffusion policies to label noise when fine-tuned with human preferences, which often leads to significant performance degradation. The authors propose a unified Markov decision process (MDP) framework that, for the first time, jointly models the multi-step denoising process of diffusion policies and environment dynamics, enabling direct preference optimization (DPO) without explicit reward signals. By introducing a conservative clipping mechanism grounded in geometric assumptions, the method robustly handles up to 30% incorrect preference labels without requiring prior knowledge of the noise distribution. Experiments demonstrate that the approach substantially outperforms current methods on long-horizon manipulation tasks, effectively aligning diverse pre-trained diffusion policies with human intent while maintaining stable performance under high levels of label noise.

Technology Category

Application Category

📝 Abstract

Diffusion policies are a powerful paradigm for robotic control, but fine-tuning them with human preferences is fundamentally challenged by the multi-step structure of the denoising process. To overcome this, we introduce a Unified Markov Decision Process (MDP) formulation that coherently integrates the diffusion denoising chain with environmental dynamics, enabling reward-free Direct Preference Optimization (DPO) for diffusion policies. Building on this formulation, we propose RoDiF (Robust Direct Fine-Tuning), a method that explicitly addresses corrupted human preferences. RoDiF reinterprets the DPO objective through a geometric hypothesis-cutting perspective and employs a conservative cutting strategy to achieve robustness without assuming any specific noise distribution. Extensive experiments on long-horizon manipulation tasks show that RoDiF consistently outperforms state-of-the-art baselines, effectively steering pretrained diffusion policies of diverse architectures to human-preferred modes, while maintaining strong performance even under 30% corrupted preference labels.

Problem

Research questions and friction points this paper is trying to address.

diffusion policies

human feedback

corrupted preferences

robust fine-tuning

robotic control

Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion policies

Direct Preference Optimization

robust fine-tuning

corrupted human feedback

Unified MDP

🔎 Similar Papers

Human-Feedback Efficient Reinforcement Learning for Online Diffusion Model Finetuning

2024-10-07arXiv.orgCitations: 0