🤖 AI Summary
Text-to-image generation suffers from poor subject consistency and inefficiency in fine-tuning methods that struggle to balance image quality and computational cost.
Method: This paper proposes a subject-driven generation framework based on dual diffusion models. It freezes a large pre-trained diffusion model to preserve its rich semantic priors, while only lightly fine-tuning a compact subject-specific model. Feature fusion and attention modulation mechanisms enable collaborative inference: the large model provides contextual semantic guidance, whereas the small model specializes in subject representation. Crucially, no full-model fine-tuning is required.
Results: The method achieves high-fidelity, diverse, and cross-scene consistent subject generation within minute-level inference time. It significantly outperforms state-of-the-art approaches on multiple benchmarks, simultaneously improving subject consistency and image quality while drastically reducing computational overhead.
📝 Abstract
Recent text-to-image diffusion models can generate striking visuals from text prompts, but they often fail to maintain subject consistency across generations and contexts. One major limitation of current fine-tuning approaches is the inherent trade-off between quality and efficiency. Fine-tuning large models improves fidelity but is computationally expensive, while fine-tuning lightweight models improves efficiency but compromises image fidelity. Moreover, fine-tuning pre-trained models on a small set of images of the subject can damage the existing priors, resulting in suboptimal results. To this end, we present Stencil, a novel framework that jointly employs two diffusion models during inference. Stencil efficiently fine-tunes a lightweight model on images of the subject, while a large frozen pre-trained model provides contextual guidance during inference, injecting rich priors to enhance generation with minimal overhead. Stencil excels at generating high-fidelity, novel renditions of the subject in less than a minute, delivering state-of-the-art performance and setting a new benchmark in subject-driven generation.