🤖 AI Summary
Visual design preferences are highly subjective and exhibit substantial individual variation, yet existing generative models rely on aggregated crowd-sourced annotations, limiting their ability to capture personalized aesthetic requirements. Method: We introduce DesignPref—the first multi-level pairwise preference dataset for UI design—comprising 12,000 samples annotated by 20 professional designers. Inter-annotator disagreement is quantified via Krippendorff’s α (0.25), while natural language rationale analysis reveals that discrepancies stem from divergent weighting of design dimensions and idiosyncratic aesthetic criteria. We propose a fine-grained preference modeling framework integrating Krippendorff-based consistency assessment, rationale-driven feature extraction, and RAG-augmented personalized fine-tuning. Contribution/Results: Our method achieves superior individual preference prediction accuracy using only 5% of the training data required by conventional majority-voting baselines, establishing a new foundation for personalized visual generation.
📝 Abstract
Generative models, such as large language models and text-to-image diffusion models, are increasingly used to create visual designs like user interfaces (UIs) and presentation slides. Finetuning and benchmarking these generative models have often relied on datasets of human-annotated design preferences. Yet, due to the subjective and highly personalized nature of visual design, preference varies widely among individuals. In this paper, we study this problem by introducing DesignPref, a dataset of 12k pairwise comparisons of UI design generation annotated by 20 professional designers with multi-level preference ratings. We found that among trained designers, substantial levels of disagreement exist (Krippendorff's alpha = 0.25 for binary preferences). Natural language rationales provided by these designers indicate that disagreements stem from differing perceptions of various design aspect importance and individual preferences. With DesignPref, we demonstrate that traditional majority-voting methods for training aggregated judge models often do not accurately reflect individual preferences. To address this challenge, we investigate multiple personalization strategies, particularly fine-tuning or incorporating designer-specific annotations into RAG pipelines. Our results show that personalized models consistently outperform aggregated baseline models in predicting individual designers'preferences, even when using 20 times fewer examples. Our work provides the first dataset to study personalized visual design evaluation and support future research into modeling individual design taste.