🤖 AI Summary
Existing diffusion models struggle to perform fine-grained, continuous, and intensity-controllable editing of image aesthetic attributes (e.g., “brightness”, “refinement”) due to reliance on ambiguous text prompts or costly human preference annotations, limiting scalability.
Method: We propose a plug-and-play aesthetic control framework that leverages a pretrained vision-language model to quantify semantic similarity of abstract aesthetics, and introduces a lightweight value encoder that maps [0,1] intensity scalars into differentiable embeddings—seamlessly integrated into text-conditioned diffusion sampling.
Contribution/Results: Our method requires no human preference labels, enables independent or joint control over multiple attributes, supports continuous cross-intensity editing, and is compatible with mainstream open-source generators (e.g., Stable Diffusion). Experiments demonstrate significant improvements over baselines in both single-attribute fidelity and multi-attribute coordination, achieving high practicality, flexibility, and scalability.
📝 Abstract
Recent breakthroughs in text-to-image diffusion models have significantly enhanced both the visual fidelity and semantic controllability of generated images. However, fine-grained control over aesthetic attributes remains challenging, especially when users require continuous and intensity-specific adjustments. Existing approaches often rely on vague textual prompts, which are inherently ambiguous in expressing both the aesthetic semantics and the desired intensity, or depend on costly human preference data for alignment, limiting their scalability and practicality. To address these limitations, we propose AttriCtrl, a plug-and-play framework for precise and continuous control of aesthetic attributes. Specifically, we quantify abstract aesthetics by leveraging semantic similarity from pre-trained vision-language models, and employ a lightweight value encoder that maps scalar intensities in $[0,1]$ to learnable embeddings within diffusion-based generation. This design enables intuitive and customizable aesthetic manipulation, with minimal training overhead and seamless integration into existing generation pipelines. Extensive experiments demonstrate that AttriCtrl achieves accurate control over individual attributes as well as flexible multi-attribute composition. Moreover, it is fully compatible with popular open-source controllable generation frameworks, showcasing strong integration capability and practical utility across diverse generation scenarios.