🤖 AI Summary
Generative models trained via likelihood or reconstruction losses often fail to ensure perceptual quality, semantic fidelity, and physical plausibility. Method: This work systematically surveys and advances the structured application of reinforcement learning (RL) in visual generation, proposing a general optimization paradigm for high-dimensional generative tasks that directly optimizes non-differentiable, multi-objective, and temporally consistent perceptual metrics. By deeply integrating RL—encompassing reward modeling, policy gradient methods, and human feedback—with mainstream generative frameworks (e.g., diffusion models and GANs), the approach enhances controllability, coherence, and realism across images, videos, and 3D/4D content. Contribution/Results: It establishes the first unified methodology for RL-driven visual generation and demonstrates improved alignment with human preferences in multimodal synthesis. The work further identifies key future directions, including cross-modal alignment, embodied simulation, and interpretable reward design.
📝 Abstract
Generative models have made significant progress in synthesizing visual content, including images, videos, and 3D/4D structures. However, they are typically trained with surrogate objectives such as likelihood or reconstruction loss, which often misalign with perceptual quality, semantic accuracy, or physical realism. Reinforcement learning (RL) offers a principled framework for optimizing non-differentiable, preference-driven, and temporally structured objectives. Recent advances demonstrate its effectiveness in enhancing controllability, consistency, and human alignment across generative tasks. This survey provides a systematic overview of RL-based methods for visual content generation. We review the evolution of RL from classical control to its role as a general-purpose optimization tool, and examine its integration into image, video, and 3D/4D generation. Across these domains, RL serves not only as a fine-tuning mechanism but also as a structural component for aligning generation with complex, high-level goals. We conclude with open challenges and future research directions at the intersection of RL and generative modeling.