MUSE: Manipulating Unified Framework for Synthesizing Emotions in Images via Test-Time Optimization

📅 2025-11-25

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Existing image emotion synthesis (IES) methods treat generation and editing as disjoint tasks, limiting their applicability in therapeutic interventions and narrative creation—scenarios requiring tight integration of both. This paper introduces MUSE, the first unified framework for image emotion generation and editing, enabling end-to-end emotional control at test time without fine-tuning or task-specific data. Its core contributions are: (1) leveraging a pre-trained emotion classifier as supervision, augmented by dynamic semantic similarity assessment to determine optimal emotion guidance timing; (2) introducing a multi-emotion loss function that explicitly suppresses interference from semantically similar emotions; and (3) establishing a novel paradigm balancing output stability, timing sensitivity, and content fidelity. Extensive experiments demonstrate that MUSE consistently outperforms state-of-the-art methods across emotion accuracy, semantic diversity, and text-image alignment.

Technology Category

Application Category

📝 Abstract

Images evoke emotions that profoundly influence perception, often prioritized over content. Current Image Emotional Synthesis (IES) approaches artificially separate generation and editing tasks, creating inefficiencies and limiting applications where these tasks naturally intertwine, such as therapeutic interventions or storytelling. In this work, we introduce MUSE, the first unified framework capable of both emotional generation and editing. By adopting a strategy conceptually aligned with Test-Time Scaling (TTS) that widely used in both LLM and diffusion model communities, it avoids the requirement for additional updating diffusion model and specialized emotional synthesis datasets. More specifically, MUSE addresses three key questions in emotional synthesis: (1) HOW to stably guide synthesis by leveraging an off-the-shelf emotion classifier with gradient-based optimization of emotional tokens; (2) WHEN to introduce emotional guidance by identifying the optimal timing using semantic similarity as a supervisory signal; and (3) WHICH emotion to guide synthesis through a multi-emotion loss that reduces interference from inherent and similar emotions. Experimental results show that MUSE performs favorably against all methods for both generation and editing, improving emotional accuracy and semantic diversity while maintaining an optimal balance between desired content, adherence to text prompts, and realistic emotional expression. It establishes a new paradigm for emotion synthesis.

Problem

Research questions and friction points this paper is trying to address.

Unifies emotional image generation and editing tasks

Eliminates need for specialized datasets and model updates

Optimizes emotion guidance timing and reduces interference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework for emotional generation and editing

Leverages test-time optimization with emotion classifier

Employs multi-emotion loss to reduce interference

🔎 Similar Papers

Make Me Happier: Evoking Emotions Through Image Diffusion Models