Emotion-Director: Bridging Affective Shortcut in Emotion-Oriented Image Generation

📅 2025-12-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing affective image generation methods suffer from the “affective shortcut” problem—equating emotion solely with semantic descriptions—yielding outputs lacking authentic emotional expression. To address this, we propose Emotion-Director, a cross-modal collaborative framework comprising MC-Diffusion (a diffusion-based generator) and MC-Agent (a prompt rewriting system). Our core innovation is emotion-visual disentanglement modeling: (i) disentangling emotion from semantics via DPO optimization augmented with negative visual prompts; and (ii) generating subjective, non-semantic emotion prompts through multi-agent chained conceptual reasoning. The method integrates diffusion modeling, cross-modal contrastive learning, and chained conceptual prompting. Extensive experiments on multiple affective benchmarks demonstrate significant improvements over state-of-the-art methods. Quantitative and qualitative evaluations confirm substantial gains in emotional accuracy, diversity, and visual expressiveness.

Technology Category

Application Category

📝 Abstract
Image generation based on diffusion models has demonstrated impressive capability, motivating exploration into diverse and specialized applications. Owing to the importance of emotion in advertising, emotion-oriented image generation has attracted increasing attention. However, current emotion-oriented methods suffer from an affective shortcut, where emotions are approximated to semantics. As evidenced by two decades of research, emotion is not equivalent to semantics. To this end, we propose Emotion-Director, a cross-modal collaboration framework consisting of two modules. First, we propose a cross-Modal Collaborative diffusion model, abbreviated as MC-Diffusion. MC-Diffusion integrates visual prompts with textual prompts for guidance, enabling the generation of emotion-oriented images beyond semantics. Further, we improve the DPO optimization by a negative visual prompt, enhancing the model's sensitivity to different emotions under the same semantics. Second, we propose MC-Agent, a cross-Modal Collaborative Agent system that rewrites textual prompts to express the intended emotions. To avoid template-like rewrites, MC-Agent employs multi-agents to simulate human subjectivity toward emotions, and adopts a chain-of-concept workflow that improves the visual expressiveness of the rewritten prompts. Extensive qualitative and quantitative experiments demonstrate the superiority of Emotion-Director in emotion-oriented image generation.
Problem

Research questions and friction points this paper is trying to address.

Addresses affective shortcut in emotion-oriented image generation
Proposes cross-modal framework for emotion beyond semantics
Enhances visual expressiveness of emotional prompts via multi-agent system
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal collaborative diffusion model integrates visual and textual prompts
Improved DPO optimization with negative visual prompt enhances emotion sensitivity
Multi-agent system rewrites prompts using chain-of-concept workflow for expressiveness
🔎 Similar Papers
No similar papers found.
G
Guoli Jia
Tsinghua University
J
Junyao Hu
The Hong Kong Polytechnic University
Xinwei Long
Xinwei Long
Tsinghua University
natural language processingmulti-modal learning
K
Kai Tian
Tsinghua University, Frontis.AI
Kaiyan Zhang
Kaiyan Zhang
Tsinghua University
Foundation ModelCollective IntelligenceScientific Intelligence
K
KaiKai Zhao
Tsinghua University, China Unicom
N
Ning Ding
Tsinghua University
B
Bowen Zhou
Tsinghua University, Shanghai Artificial Intelligence Lab