SEER: The Span-based Emotion Evidence Retrieval Benchmark

📅 2025-10-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper introduces the span-level sentiment evidence detection task, which aims to precisely localize textual segments conveying sentiment—contrasting with conventional sentence-level sentiment classification—and thereby supports applications requiring fine-grained sentiment understanding, such as empathetic dialogue systems and clinical decision support. To this end, we construct the first manually annotated, multi-level benchmark (span-level labels for both single sentences and five-sentence paragraphs), systematically evaluating 14 open-source large language models. Our key contribution is reframing sentiment analysis from *discriminating sentiment categories* to *localizing sentiment-supporting evidence*, thereby advancing model interpretability and mechanistic understanding of sentiment expression. Experiments reveal that while certain models approach human performance on single-sentence detection, their accuracy degrades markedly on longer contexts, exposing critical limitations—including keyword dependency and high false-positive rates on neutral text. This benchmark establishes a novel evaluation paradigm and foundational infrastructure for explainable sentiment computation.

Technology Category

Application Category

📝 Abstract
We introduce the SEER (Span-based Emotion Evidence Retrieval) Benchmark to test Large Language Models' (LLMs) ability to identify the specific spans of text that express emotion. Unlike traditional emotion recognition tasks that assign a single label to an entire sentence, SEER targets the underexplored task of emotion evidence detection: pinpointing which exact phrases convey emotion. This span-level approach is crucial for applications like empathetic dialogue and clinical support, which need to know how emotion is expressed, not just what the emotion is. SEER includes two tasks: identifying emotion evidence within a single sentence, and identifying evidence across a short passage of five consecutive sentences. It contains new annotations for both emotion and emotion evidence on 1200 real-world sentences. We evaluate 14 open-source LLMs and find that, while some models approach average human performance on single-sentence inputs, their accuracy degrades in longer passages. Our error analysis reveals key failure modes, including overreliance on emotion keywords and false positives in neutral text.
Problem

Research questions and friction points this paper is trying to address.

Identifying text spans expressing emotion in sentences
Detecting emotion evidence across multi-sentence passages
Evaluating LLMs' limitations in emotion evidence retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

Identifies emotion evidence spans in text
Evaluates models on single and multi-sentence tasks
Uses real-world annotations to test LLM limitations
🔎 Similar Papers
No similar papers found.