🤖 AI Summary
To address generative distortion in subject-driven fine-tuning caused by heterogeneous training image quality, this paper proposes QZLoRA: a framework that leverages a vision-language model (VLM) as an “examiner” to automatically assess and rank candidate images—treated as “teaching interventions”—based on their representational fidelity to the target concept (e.g., female mountain bluebird), via the QuizRank method. This ranking guides low-rank adaptation (LoRA)-based parameter-efficient fine-tuning without manual annotation or auxiliary training. QZLoRA significantly improves semantic alignment and concept fidelity. Experiments demonstrate that, with fewer training samples, it outperforms baselines across photorealistic and stylized generation tasks, achieving substantial gains in FID, CLIP-Score, and human evaluation.
📝 Abstract
A challenge in fine-tuning text-to-image diffusion models for specific topics is to select good examples. Fine-tuning from image sets of varying quality, such as Wikipedia Commons, will often produce poor output. However, training images that extit{do} exemplify the target concept (e.g., a extit{female Mountain Bluebird}) help ensure that the generated images are similarly representative (e.g., have the prototypical blue-wings and gray chest). In this work, we propose QZLoRA, a framework to select images for low-rank adaptation (LoRA). The approach leverages QuizRank, a method to automatically rank images by treating them as an `educational intervention' and `quizzing' a VLM. We demonstrate that QZLoRA can produce better aligned, photorealistic images with fewer samples. We also show that these fine-tuned models can produce stylized that are similarly representative (i.e., illustrations). Our results highlight the promise of combining automated visual reasoning with parameter-efficient fine-tuning for topic-adaptive generative modeling.