🤖 AI Summary
This work addresses key limitations of existing visual autoregressive (VAR) models in subject-driven image generation, particularly the inconsistency between multi-scale conditional training and inference as well as insufficient semantic alignment. To resolve these issues, the authors propose a pre-filled subject feature sequence mechanism that extracts reference subject features via a multi-scale visual tokenizer and fully injects them prior to autoregressive generation, thereby eliminating train-test discrepancies and simplifying dependency modeling. Furthermore, this study introduces reinforcement learning into the VAR framework for the first time, jointly optimizing semantic alignment and subject consistency. The proposed approach significantly enhances appearance fidelity and achieves superior generation quality compared to state-of-the-art diffusion models.
📝 Abstract
Recent advances in subject-driven image generation using diffusion models have attracted considerable attention for their remarkable capabilities in producing high-quality images. Nevertheless, the potential of Visual Autoregressive (VAR) models, despite their unified architecture and efficient inference, remains underexplored. In this work, we present DreamVAR, a novel framework for subject-driven image synthesis built upon a VAR model that employs next-scale prediction. Technically, multi-scale features of the reference subject are first extracted by a visual tokenizer. Instead of interleaving these conditional features with target image tokens across scales, our DreamVAR pre-fills the full subject feature sequence prior to predicting target image tokens. This design simplifies autoregressive dependencies and mitigates the train-test discrepancy in multi-scale conditioning scenario within the VAR paradigm. DreamVAR further incorporates reinforcement learning to jointly enhance semantic alignment and subject consistency. Extensive experiments demonstrate that DreamVAR achieves superior appearance preservation compared to leading diffusion-based methods.