SOAR: Self-Correction for Optimal Alignment and Refinement in Diffusion Models

📅 2026-04-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

184K/year
🤖 AI Summary
Diffusion models trained via supervised fine-tuning (SFT) optimize only along ideal denoising trajectories and lack a correction mechanism during inference when trajectory deviations occur, leading to exposure bias. Meanwhile, reinforcement learning (RL) approaches suffer from sparse rewards, credit assignment challenges, and reward hacking. To bridge this gap, this work proposes SOAR, a novel method that performs single-step, gradient-free forward generation on real samples, re-noises states that deviate from the clean data manifold, and supervises the model to regress toward the original clean targets. SOAR introduces, for the first time, a reward-free, online policy self-correction mechanism that provides dense temporal supervision while naturally subsuming standard SFT objectives. Experiments on SD3.5-Medium show consistent improvements: GenEval rises from 0.70 to 0.78, OCR from 0.64 to 0.67, and all preference scores increase, with SOAR outperforming Flow-GRPO even without any reward signal.

Technology Category

Application Category

📝 Abstract
The post-training pipeline for diffusion models currently has two stages: supervised fine-tuning (SFT) on curated data and reinforcement learning (RL) with reward models. A fundamental gap separates them. SFT optimizes the denoiser only on ground-truth states sampled from the forward noising process; once inference deviates from these ideal states, subsequent denoising relies on out-of-distribution generalization rather than learned correction, exhibiting the same exposure bias that afflicts autoregressive models, but accumulated along the denoising trajectory instead of the token sequence. RL can in principle address this mismatch, yet its terminal reward signal is sparse, suffers from credit-assignment difficulty, and risks reward hacking. We propose SOAR (Self-Correction for Optimal Alignment and Refinement), a bias-correction post-training method that fills this gap. Starting from a real sample, SOAR performs a single stop-gradient rollout with the current model, re-noises the resulting off-trajectory state, and supervises the model to steer back toward the original clean target. The method is on-policy, reward-free, and provides dense per-timestep supervision with no credit-assignment problem. On SD3.5-Medium, SOAR improves GenEval from 0.70 to 0.78 and OCR from 0.64 to 0.67 over SFT, while simultaneously raising all model-based preference scores. In controlled reward-specific experiments, SOAR surpasses Flow-GRPO in final metric value on both aesthetic and text-image alignment tasks, despite having no access to a reward model. Since SOAR's base loss subsumes the standard SFT objective, it can directly replace SFT as a stronger first post-training stage after pretraining, while remaining fully compatible with subsequent RL alignment.
Problem

Research questions and friction points this paper is trying to address.

exposure bias
diffusion models
post-training
out-of-distribution generalization
credit assignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-correction
diffusion models
post-training
exposure bias
on-policy learning
🔎 Similar Papers
No similar papers found.