EICAP: Deep Dive in Assessment and Enhancement of Large Language Models in Emotional Intelligence through Multi-Turn Conversations

📅 2025-08-08

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Contemporary large language models (LLMs) exhibit significant deficiencies in deep affective intelligence tasks—such as emotion tracking, attribution inference, and contextually appropriate response generation—within multi-turn dialogues. Method: We propose EICAP, the first psychology-grounded, four-layer affective intelligence taxonomy (Evaluation, Tracking, Attribution, Response), and introduce EICAP-Bench—a cross-cultural, multi-turn dialogue-oriented benchmark in multiple-choice format—to systematically expose limitations of current instruction-tuning data in affective reasoning. Using LoRA, we conduct English–Arabic bilingual fine-tuning on Qwen2.5 and perform ablation studies on UltraChat. Contribution/Results: Statistical analysis reveals substantial improvement only in the Evaluation layer, with marginal gains in Tracking, Attribution, and Response layers. Among six open-source models evaluated, Qwen2.5-Instruct achieves the best overall performance. This work establishes a theoretical framework and empirical benchmark for fine-grained evaluation and targeted enhancement of LLMs’ affective intelligence.

Technology Category

Application Category

📝 Abstract

Emotional Intelligence (EI) is a critical yet underexplored dimension in the development of human-aligned LLMs. To address this gap, we introduce a unified, psychologically grounded four-layer taxonomy of EI tailored for large language models (LLMs), encompassing emotional tracking, cause inference, appraisal, and emotionally appropriate response generation. Building on this framework, we present EICAP-Bench, a novel MCQ style multi-turn benchmark designed to evaluate EI capabilities in open-source LLMs across diverse linguistic and cultural contexts. We evaluate six LLMs: LLaMA3 (8B), LLaMA3-Instruct, Gemma (9B), Gemma-Instruct, Qwen2.5 (7B), and Qwen2.5-Instruct on EmoCap-Bench, identifying Qwen2.5-Instruct as the strongest baseline. To assess the potential for enhancing EI capabilities, we fine-tune both Qwen2.5-Base and Qwen2.5-Instruct using LoRA adapters on UltraChat (UC), a large-scale, instruction-tuned dialogue dataset, in both English and Arabic. Our statistical analysis reveals that among the five EI layers, only the Appraisal layer shows significant improvement through UC-based fine-tuning. These findings highlight the limitations of existing pretraining and instruction-tuning paradigms in equipping LLMs with deeper emotional reasoning and underscore the need for targeted data and modeling strategies for comprehensive EI alignment.

Problem

Research questions and friction points this paper is trying to address.

Assessing emotional intelligence in large language models

Developing a taxonomy for EI evaluation in LLMs

Enhancing EI capabilities through targeted fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Four-layer EI taxonomy for LLMs

EICAP-Bench for multi-turn EI evaluation

LoRA fine-tuning on UltraChat dataset

🔎 Similar Papers

Can Large Language Models be Good Emotional Supporter? Mitigating Preference Bias on Emotional Support Conversation