Selecting Fine-Tuning Examples by Quizzing VLMs

📅 2025-11-14

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

To address generative distortion in subject-driven fine-tuning caused by heterogeneous training image quality, this paper proposes QZLoRA: a framework that leverages a vision-language model (VLM) as an “examiner” to automatically assess and rank candidate images—treated as “teaching interventions”—based on their representational fidelity to the target concept (e.g., female mountain bluebird), via the QuizRank method. This ranking guides low-rank adaptation (LoRA)-based parameter-efficient fine-tuning without manual annotation or auxiliary training. QZLoRA significantly improves semantic alignment and concept fidelity. Experiments demonstrate that, with fewer training samples, it outperforms baselines across photorealistic and stylized generation tasks, achieving substantial gains in FID, CLIP-Score, and human evaluation.

Technology Category

Application Category

📝 Abstract

A challenge in fine-tuning text-to-image diffusion models for specific topics is to select good examples. Fine-tuning from image sets of varying quality, such as Wikipedia Commons, will often produce poor output. However, training images that extit{do} exemplify the target concept (e.g., a extit{female Mountain Bluebird}) help ensure that the generated images are similarly representative (e.g., have the prototypical blue-wings and gray chest). In this work, we propose QZLoRA, a framework to select images for low-rank adaptation (LoRA). The approach leverages QuizRank, a method to automatically rank images by treating them as an `educational intervention' and `quizzing' a VLM. We demonstrate that QZLoRA can produce better aligned, photorealistic images with fewer samples. We also show that these fine-tuned models can produce stylized that are similarly representative (i.e., illustrations). Our results highlight the promise of combining automated visual reasoning with parameter-efficient fine-tuning for topic-adaptive generative modeling.

Problem

Research questions and friction points this paper is trying to address.

Selecting high-quality training examples for text-to-image model fine-tuning

Improving visual accuracy and representativeness in generated images

Automating image ranking for parameter-efficient diffusion model adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses QuizRank to rank images automatically

Leverages VLM quizzing for example selection

Applies LoRA for parameter-efficient fine-tuning

🔎 Similar Papers

Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs