🤖 AI Summary
Existing emotion recognition in conversations (ERC) approaches predominantly rely on multi-stage instruction tuning, which struggles to jointly model the dynamic interactions among speaker characteristics and contextual information, resulting in insufficient alignment among speaker identity, context, and emotion. To address this, we propose InitERC—a single-stage in-context instruction fine-tuning framework that unifies the modeling of speakers, context, and emotions for end-to-end alignment. Its core innovations include: (i) adaptive selection of contextual demonstrations from a curated demonstration pool; (ii) a structured prompt template explicitly encoding speaker–context–emotion relationships; and (iii) an in-context instruction fine-tuning mechanism that enables implicit learning of interaction patterns without parameter updates. Evaluated on three benchmark datasets—MELD, EmoryNLP, and IEMOCAP—InitERC consistently outperforms state-of-the-art methods, demonstrating the critical impact of both the quality and organizational structure of contextual demonstrations on ERC performance.
📝 Abstract
Emotion recognition in conversation (ERC) aims to identify the emotion of each utterance in a conversation, playing a vital role in empathetic artificial intelligence. With the growing of large language models (LLMs), instruction tuning has emerged as a critical paradigm for ERC. Existing studies mainly focus on multi-stage instruction tuning, which first endows LLMs with speaker characteristics, and then conducts context-aware instruction tuning to comprehend emotional states. However, these methods inherently constrains the capacity to jointly capture the dynamic interaction between speaker characteristics and conversational context, resulting in weak alignment among speaker identity, contextual cues, and emotion states within a unified framework. In this paper, we propose InitERC, a simple yet effective one-stage in-context instruction tuning framework for ERC. InitERC adapts LLMs to learn speaker-context-emotion alignment from context examples via in-context instruction tuning. Specifically, InitERC comprises four components, i.e., demonstration pool construction, in-context example selection, prompt template design, and in-context instruction tuning. To explore the impact of in-context examples, we conduct a comprehensive study on three key factors: retrieval strategy, example ordering, and the number of examples. Extensive experiments on three widely used datasets demonstrate that our proposed InitERC achieves substantial improvements over the state-of-the-art baselines.