Moodifier: MLLM-Enhanced Emotion-Driven Image Editing

📅 2025-07-18

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses core challenges in emotion-driven image editing—namely, the difficulty of mapping abstract emotions to concrete visual attributes, poor cross-domain generalization, and low content fidelity. To this end, we introduce MoodArchive, the first large-scale, emotion-annotated dataset specifically designed for this task, comprising 12 emotion categories and 150K image–text pairs. We further propose MoodifyCLIP, a vision-language model trained via contrastive learning to achieve precise disentangled representations of fine-grained visual features (e.g., color, texture, pose) conditioned on emotional semantics. Building upon CLIP and multimodal large language models (MLLMs), we design a zero-shot, tuning-free, context-aware editing framework. Extensive evaluation across diverse domains—including character expression, fashion design, jewelry, and interior styling—demonstrates significant improvements in emotion accuracy (+12.3%) and content fidelity (+9.7%). The code, models, and MoodArchive dataset will be publicly released.

Technology Category

Application Category

📝 Abstract

Bridging emotions and visual content for emotion-driven image editing holds great potential in creative industries, yet precise manipulation remains challenging due to the abstract nature of emotions and their varied manifestations across different contexts. We tackle this challenge with an integrated approach consisting of three complementary components. First, we introduce MoodArchive, an 8M+ image dataset with detailed hierarchical emotional annotations generated by LLaVA and partially validated by human evaluators. Second, we develop MoodifyCLIP, a vision-language model fine-tuned on MoodArchive to translate abstract emotions into specific visual attributes. Third, we propose Moodifier, a training-free editing model leveraging MoodifyCLIP and multimodal large language models (MLLMs) to enable precise emotional transformations while preserving content integrity. Our system works across diverse domains such as character expressions, fashion design, jewelry, and home décor, enabling creators to quickly visualize emotional variations while preserving identity and structure. Extensive experimental evaluations show that Moodifier outperforms existing methods in both emotional accuracy and content preservation, providing contextually appropriate edits. By linking abstract emotions to concrete visual changes, our solution unlocks new possibilities for emotional content creation in real-world applications. We will release the MoodArchive dataset, MoodifyCLIP model, and make the Moodifier code and demo publicly available upon acceptance.

Problem

Research questions and friction points this paper is trying to address.

Bridging emotions and visual content for precise image editing

Translating abstract emotions into specific visual attributes

Enabling emotional transformations while preserving content integrity

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLaVA-annotated MoodArchive dataset with 8M+ images

MoodifyCLIP model translating emotions to visual attributes

Training-free Moodifier model using MLLMs for precise edits

🔎 Similar Papers

EmoEdit: Evoking Emotions through Image Manipulation