Discrete Prompt Tuning via Recursive Utilization of Black-box Multimodal Large Language Model for Personalized Visual Emotion Recognition

📅 2025-08-30

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Current multimodal large language models (MLLMs) struggle with personalized visual emotion recognition (VER) due to training data biases toward population-level consensus, limiting their ability to capture individual subjective affective differences. To address this, we propose a recursive discrete prompt tuning method for black-box MLLMs: inspired by human prompt engineering, it treats natural language prompts as learnable discrete parameters, dynamically adapting to user-specific affective cognition through iterative prompt generation, semantic filtering, and refinement—without accessing model internals or gradients. The method relies solely on input-output interactions. Experiments demonstrate significant improvements in per-user emotion prediction accuracy. This work is the first to validate the efficacy of pure prompt tuning for customized perceptual tasks, establishing a new paradigm for privacy-preserving, low-resource personalized multimodal understanding.

Technology Category

Application Category

📝 Abstract

Visual Emotion Recognition (VER) is an important research topic due to its wide range of applications, including opinion mining and advertisement design. Extending this capability to recognize emotions at the individual level further broadens its potential applications. Recently, Multimodal Large Language Models (MLLMs) have attracted increasing attention and demonstrated performance comparable to that of conventional VER methods. However, MLLMs are trained on large and diverse datasets containing general opinions, which causes them to favor majority viewpoints and familiar patterns. This tendency limits their performance in a personalized VER, which is crucial for practical and real-world applications, and indicates a key area for improvement. To address this limitation, the proposed method employs discrete prompt tuning inspired by the process of humans' prompt engineering to adapt the VER task to each individual. Our method selects the best natural language representation from the generated prompts and uses it to update the prompt for the realization of accurate personalized VER.

Problem

Research questions and friction points this paper is trying to address.

Adapting MLLMs for personalized visual emotion recognition

Overcoming bias towards majority viewpoints in MLLMs

Enhancing individual-level emotion recognition accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Discrete prompt tuning for personalization

Recursive black-box MLLM utilization

Natural language representation selection

🔎 Similar Papers

Contextual Emotion Recognition using Large Vision Language Models