Match & Choose: Model Selection Framework for Fine-tuning Text-to-Image Diffusion Models

📅 2025-08-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Selecting the optimal pre-trained text-to-image (T2I) diffusion model for fine-tuning remains an open challenge. Method: This paper proposes M&C, the first systematic model selection framework, which constructs a model–dataset matching graph to explicitly model performance correlations between models and datasets, and employs a graph neural network to jointly encode model architectures, dataset characteristics, and graph topology for predicting relative fine-tuning performance in target domains. Contribution/Results: M&C eliminates costly exhaustive fine-tuning. Experiments across 10 T2I models and 32 datasets show that M&C identifies the top-performing model in 61.3% of cases and consistently recommends high-performing alternatives otherwise—substantially improving both efficiency and accuracy of model selection. To our knowledge, this is the first work to introduce graph-structured modeling into T2I model selection, establishing a novel paradigm for efficient large-model adaptation.

Technology Category

Application Category

📝 Abstract
Text-to-image (T2I) models based on diffusion and transformer architectures advance rapidly. They are often pretrained on large corpora, and openly shared on a model platform, such as HuggingFace. Users can then build up AI applications, e.g., generating media contents, by adopting pretrained T2I models and fine-tuning them on the target dataset. While public pretrained T2I models facilitate the democratization of the models, users face a new challenge: which model can be best fine-tuned based on the target data domain? Model selection is well addressed in classification tasks, but little is known in (pretrained) T2I models and their performance indication on the target domain. In this paper, we propose the first model selection framework, M&C, which enables users to efficiently choose a pretrained T2I model from a model platform without exhaustively fine-tuning them all on the target dataset. The core of M&C is a matching graph, which consists of: (i) nodes of available models and profiled datasets, and (ii) edges of model-data and data-data pairs capturing the fine-tuning performance and data similarity, respectively. We then build a model that, based on the inputs of model/data feature, and, critically, the graph embedding feature, extracted from the matching graph, predicts the model achieving the best quality after fine-tuning for the target domain. We evaluate M&C on choosing across ten T2I models for 32 datasets against three baselines. Our results show that M&C successfully predicts the best model for fine-tuning in 61.3% of the cases and a closely performing model for the rest.
Problem

Research questions and friction points this paper is trying to address.

Selecting best pretrained text-to-image model for fine-tuning
Predicting model performance on target data domain
Avoiding exhaustive fine-tuning of all available models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Model selection framework for T2I fine-tuning
Matching graph with model-data pairs
Predicts best model using graph embedding
🔎 Similar Papers
No similar papers found.