🤖 AI Summary
This work addresses the prevailing neglect of subjective emotional content in existing visual customization methods and the absence of a general-purpose foundation model tailored for affective image generation. To bridge this gap, we propose a novel task—Language-driven Affective Visual Customization (L-AVC)—centered on multimodal large language models, which for the first time introduces explicit control over subjective emotion into image editing. The task mandates semantic-level emotional transformation while preserving emotion-irrelevant visual content. To facilitate research in this direction, we construct the first L-AVC dataset and introduce two key components: an Emotion-to-Image Conversion (EIC) module for effective affective semantic transfer and a Precise Emotion-preserving Retention (PER) module to maintain non-emotional content fidelity. Experiments demonstrate that our approach significantly outperforms current state-of-the-art methods on the proposed dataset, confirming both the critical role of emotional information in visual customization and the efficacy of our framework.
📝 Abstract
Previous studies on visual customization primarily rely on the objective alignment between various control signals (e.g., language, layout and canny) and the edited images, which largely ignore the subjective emotional contents, and more importantly lack general-purpose foundation models for affective visual customization. With this in mind, this paper proposes an LLM-centric Affective Visual Customization (L-AVC) task, which focuses on generating images within modifying their subjective emotions via Multimodal LLM. Further, this paper contends that how to make the model efficiently align emotion conversion in semantics (named inter-emotion semantic conversion) and how to precisely retain emotion-agnostic contents (named exter-emotion semantic retaining) are rather important and challenging in this L-AVC task. To this end, this paper proposes an Efficient and Precise Emotion Manipulating approach for editing subjective emotions in images. Specifically, an Efficient Inter-emotion Converting (EIC) module is tailored to make the LLM efficiently align emotion conversion in semantics before and after editing, followed by a Precise Exter-emotion Retaining (PER) module to precisely retain the emotion-agnostic contents. Comprehensive experimental evaluations on our constructed L-AVC dataset demonstrate the great advantage of the proposed EPEM approach to the L-AVC task over several state-of-the-art baselines. This justifies the importance of emotion information for L-AVC and the effectiveness of EPEM in efficiently and precisely manipulating such information.