🤖 AI Summary
This work addresses the training collapse problem induced by Recursive Self-Improvement (RSI) in text-to-image diffusion models, attributing it to two root causes: insufficient perceptual alignment and cumulative hallucination. We propose the first RSI framework specifically designed for diffusion models, featuring three core innovations: (1) perceptual alignment prompting to enhance cross-modal semantic consistency; (2) a human-preference-driven hallucination filtering mechanism; and (3) a distribution-shift-aware sample weighting strategy. Our end-to-end pipeline integrates prompt engineering, automated hallucination filtering, and weighted retraining to effectively mitigate model degradation. Experiments demonstrate that after multiple rounds of self-generation and retraining, image fidelity and text–image alignment remain stable; human preference scores improve by 23.6%, and the hallucination rate decreases by 41.2%.
📝 Abstract
Recursive Self-Improvement (RSI) enables intelligence systems to autonomously refine their capabilities. This paper explores the application of RSI in text-to-image diffusion models, addressing the challenge of training collapse caused by synthetic data. We identify two key factors contributing to this collapse: the lack of perceptual alignment and the accumulation of generative hallucinations. To mitigate these issues, we propose three strategies: (1) a prompt construction and filtering pipeline designed to facilitate the generation of perceptual aligned data, (2) a preference sampling method to identify human-preferred samples and filter out generative hallucinations, and (3) a distribution-based weighting scheme to penalize selected samples with hallucinatory errors. Our extensive experiments validate the effectiveness of these approaches.