🤖 AI Summary
To address the high fine-tuning cost and slow inference in subject-driven personalization of text-to-image diffusion models, this paper proposes a fine-tuning-free hypernetwork framework. Our method employs an end-to-end trainable hypernetwork that directly predicts LoRA adapter weights from a single subject image, bypassing explicit optimization or synthetic data generation. To enhance compositional generalization, we introduce Hybrid Model Classifier-Free Guidance (HM-CFG), which jointly leverages guidance signals from both base and subject-specific models. Crucially, we adopt an output regularization strategy—enforcing consistency between predicted LoRA weights and target subject appearance—eliminating the need for iterative optimization trajectories or auxiliary training data. Extensive experiments on CelebA-HQ, AFHQ-v2, and DreamBench demonstrate high-fidelity subject reconstruction and precise text alignment, with real-time inference speed and open-category generalization. To our knowledge, this is the first approach achieving truly fine-tuning-free, efficient, and high-fidelity personalized generation.
📝 Abstract
Personalizing text-to-image diffusion models has traditionally relied on subject-specific fine-tuning approaches such as DreamBooth~cite{ruiz2023dreambooth}, which are computationally expensive and slow at inference. Recent adapter- and encoder-based methods attempt to reduce this overhead but still depend on additional fine-tuning or large backbone models for satisfactory results. In this work, we revisit an orthogonal direction: fine-tuning-free personalization via Hypernetworks that predict LoRA-adapted weights directly from subject images. Prior hypernetwork-based approaches, however, suffer from costly data generation or unstable attempts to mimic base model optimization trajectories. We address these limitations with an end-to-end training objective, stabilized by a simple output regularization, yielding reliable and effective hypernetworks. Our method removes the need for per-subject optimization at test time while preserving both subject fidelity and prompt alignment. To further enhance compositional generalization at inference time, we introduce Hybrid-Model Classifier-Free Guidance (HM-CFG), which combines the compositional strengths of the base diffusion model with the subject fidelity of personalized models during sampling. Extensive experiments on CelebA-HQ, AFHQ-v2, and DreamBench demonstrate that our approach achieves strong personalization performance and highlights the promise of hypernetworks as a scalable and effective direction for open-category personalization.