🤖 AI Summary
This work addresses the challenge that existing layout generation models struggle to capture fine-grained human aesthetic preferences in graphic design, and that general-purpose image-text preference data and reward models do not transfer effectively to layout evaluation tasks. To this end, the authors construct DesignSense-10k—the first human preference dataset for graphic layouts, comprising 10,235 expert-annotated pairwise comparisons—and introduce a five-stage layout transformation and filtering pipeline. This pipeline integrates semantic grouping, layout prediction, clustering, and visual-language model (VLM) refinement to generate high-quality contrastive samples, which are then used to train a specialized VLM-based preference classifier. The proposed approach improves Macro F1 by 54.6% over the strongest closed-source baseline; when employed in reinforcement learning, it increases generator win rates by 3%, and its inference-time selection strategy yields an additional 3.6% quality gain.
📝 Abstract
Graphic layouts serve as an important and engaging medium for visual communication across different channels. While recent layout generation models have demonstrated impressive capabilities, they frequently fail to align with nuanced human aesthetic judgment. Existing preference datasets and reward models trained on text-to-image generation do not generalize to layout evaluation, where the spatial arrangement of identical elements determines quality. To address this critical gap, we introduce DesignSense-10k, a large-scale dataset of 10,235 human-annotated preference pairs for graphic layout evaluation. We propose a five-stage curation pipeline that generates visually coherent layout transformations across diverse aspect ratios, using semantic grouping, layout prediction, filtering, clustering, and VLM-based refinement to produce high-quality comparison pairs. Human preferences are annotated using a 4-class scheme (left, right, both good, both bad) to capture subjective ambiguity. Leveraging this dataset, we train DesignSense, a vision-language model-based classifier that substantially outperforms existing open-source and proprietary models across comprehensive evaluation metrics (54.6% improvement in Macro F1 over the strongest proprietary baseline). Our analysis shows that frontier VLMs remain unreliable overall and fail catastrophically on the full four-class task, underscoring the need for specialized, preference-aware models. Beyond the dataset, our reward model DesignSense yields tangible downstream gains in layout generation. Using our judge during RL based training improves generator win rate by about 3%, while inference-time scaling, which involves generating multiple candidates and selecting the best one, provides a 3.6% improvement. These results highlight the practical impact of specialized, layout-aware preference modeling on real-world layout generation quality.