TMRL: Diffusion Timestep-Modulated Pretraining Enables Exploration for Efficient Policy Finetuning

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Behavior cloning pretraining often yields narrow action distributions that hinder the exploration necessary for effective reinforcement learning fine-tuning. This work proposes a unified framework that, for the first time, leverages diffusion timesteps as a learnable mechanism to modulate exploration: during pretraining, diffusion noise is injected to broaden action coverage, while during fine-tuning, timesteps are dynamically adjusted to adaptively control exploration intensity. The approach integrates Contextual Smoothing Pretraining (CSP) with Timestep-Modulated Reinforcement Learning (TMRL), supporting diverse input modalities including states, 3D point clouds, and vision-language representations. Evaluated on real robots, the method achieves efficient fine-tuning of complex manipulation tasks within one hour, demonstrating significantly higher sample efficiency than existing approaches.

📝 Abstract

Fine-tuning pre-trained robot policies with reinforcement learning (RL) often inherits the bottlenecks introduced by pre-training with behavioral cloning (BC), which produces narrow action distributions that lack the coverage necessary for downstream exploration. We present a unified framework that enables the exploration necessary to enable efficient robot policy finetuning by bridging BC pre-training and RL fine-tuning. Our pre-training method, Context-Smoothed Pre-training (CSP), injects forward-diffusion noise into policy inputs, creating a continuum between precise imitation and broad action coverage. We then fine-tune pre-trained policies via Timestep-Modulated Reinforcement Learning (TMRL), which trains the agent to dynamically adjust this conditioning during fine-tuning by modulating the diffusion timestep, granting explicit control over exploration. Integrating seamlessly with arbitrary policy inputs, e.g., states, 3D point clouds, or image-based VLA policies, we show that TMRL improves RL fine-tuning sample efficiency. Notably, TMRL enables successful real-world fine-tuning on complex manipulation tasks in under one hour. Videos and code available at https://weirdlabuw.github.io/tmrl/.

Problem

Research questions and friction points this paper is trying to address.

reinforcement learning fine-tuning

behavioral cloning

exploration

policy pretraining

sample efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Timestep Modulation

Context-Smoothed Pre-training

Reinforcement Learning Fine-tuning