🤖 AI Summary
Contemporary large language models (LLMs) exhibit significant deficiencies in deep affective intelligence tasks—such as emotion tracking, attribution inference, and contextually appropriate response generation—within multi-turn dialogues.
Method: We propose EICAP, the first psychology-grounded, four-layer affective intelligence taxonomy (Evaluation, Tracking, Attribution, Response), and introduce EICAP-Bench—a cross-cultural, multi-turn dialogue-oriented benchmark in multiple-choice format—to systematically expose limitations of current instruction-tuning data in affective reasoning. Using LoRA, we conduct English–Arabic bilingual fine-tuning on Qwen2.5 and perform ablation studies on UltraChat.
Contribution/Results: Statistical analysis reveals substantial improvement only in the Evaluation layer, with marginal gains in Tracking, Attribution, and Response layers. Among six open-source models evaluated, Qwen2.5-Instruct achieves the best overall performance. This work establishes a theoretical framework and empirical benchmark for fine-grained evaluation and targeted enhancement of LLMs’ affective intelligence.
📝 Abstract
Emotional Intelligence (EI) is a critical yet underexplored dimension in the development of human-aligned LLMs. To address this gap, we introduce a unified, psychologically grounded four-layer taxonomy of EI tailored for large language models (LLMs), encompassing emotional tracking, cause inference, appraisal, and emotionally appropriate response generation. Building on this framework, we present EICAP-Bench, a novel MCQ style multi-turn benchmark designed to evaluate EI capabilities in open-source LLMs across diverse linguistic and cultural contexts. We evaluate six LLMs: LLaMA3 (8B), LLaMA3-Instruct, Gemma (9B), Gemma-Instruct, Qwen2.5 (7B), and Qwen2.5-Instruct on EmoCap-Bench, identifying Qwen2.5-Instruct as the strongest baseline. To assess the potential for enhancing EI capabilities, we fine-tune both Qwen2.5-Base and Qwen2.5-Instruct using LoRA adapters on UltraChat (UC), a large-scale, instruction-tuned dialogue dataset, in both English and Arabic. Our statistical analysis reveals that among the five EI layers, only the Appraisal layer shows significant improvement through UC-based fine-tuning. These findings highlight the limitations of existing pretraining and instruction-tuning paradigms in equipping LLMs with deeper emotional reasoning and underscore the need for targeted data and modeling strategies for comprehensive EI alignment.