Advancing Aesthetic Image Generation via Composition Transfer

📅 2026-05-06
📈 Citations: 0
Influential: 0
📄 PDF

career value

184K/year
📝 Abstract
Composition is a cornerstone of visual aesthetics, influencing the appeal of an image. While its principles operate independently of specific content, in practice, composition is often coupled with semantics. As a result, existing methods often enhance composition either through implicit learning or by semantics-based layout control, rather than explicitly modeling composition itself. To address this gap, we introduce Composer, a framework rooted in aesthetic theory, designed to model composition in a semantic-agnostic manner. First, it supports composition transfer by extracting key composition-aware representations from a reference image and leveraging a tailored conditional guidance module to control composition based on pre-trained diffusion models. Second, when users specify only text themes without a composition reference, Composer supports theme-driven composition retrieval by leveraging the in-context learning capabilities of Large Vision-Language Models (LVLMs), achieving explicit composition planning. To enhance composition in a reference-free mode, we conduct text-to-composition fine-tuning on the trained control module to enable implicit composition planning. Furthermore, we curated a high-quality dataset comprising 2 million image-text pairs using state-of-the-art generative models to support model training. Experimental results demonstrate that Composer significantly enhances aesthetic quality in text-to-image tasks and facilitates personalized composition control and transfer, offering users precision and flexibility in the creative process.
Problem

Research questions and friction points this paper is trying to address.

composition
aesthetic image generation
semantic-agnostic
composition transfer
diffusion models
Innovation

Methods, ideas, or system contributions that make the work stand out.

composition transfer
semantic-agnostic modeling
diffusion models
Large Vision-Language Models (LVLMs)
aesthetic image generation