Reasoning Beyond Labels: Measuring LLM Sentiment in Low-Resource, Culturally Nuanced Contexts

📅 2025-08-06

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Evaluating LLMs’ sentiment understanding in low-resource, culturally complex settings remains challenging—particularly for informal, Swahili-English code-mixed messages used by youth health communities on WhatsApp in Nairobi, where sentiment is context-dependent and culturally embedded. Method: We propose the first diagnostic evaluation framework for social science measurement, integrating counterfactual sentiment flipping, calibrated interpretability assessment, and expert-annotated data to systematically probe model performance on ambiguity resolution, cross-contextual sentiment transfer, and reasoning consistency. Contribution/Results: Results reveal strong robustness in top-tier closed-source models, whereas open-source models exhibit significant degradation under cultural ambiguity or contextual shifts. This work pioneers the integration of measurement validity principles into LLM sentiment evaluation, establishing both theoretical grounding and methodological foundations for developing culturally adaptive, traceable, and inference-aware AI assessment systems.

Technology Category

Application Category

📝 Abstract

Sentiment analysis in low-resource, culturally nuanced contexts challenges conventional NLP approaches that assume fixed labels and universal affective expressions. We present a diagnostic framework that treats sentiment as a context-dependent, culturally embedded construct, and evaluate how large language models (LLMs) reason about sentiment in informal, code-mixed WhatsApp messages from Nairobi youth health groups. Using a combination of human-annotated data, sentiment-flipped counterfactuals, and rubric-based explanation evaluation, we probe LLM interpretability, robustness, and alignment with human reasoning. Framing our evaluation through a social-science measurement lens, we operationalize and interrogate LLMs outputs as an instrument for measuring the abstract concept of sentiment. Our findings reveal significant variation in model reasoning quality, with top-tier LLMs demonstrating interpretive stability, while open models often falter under ambiguity or sentiment shifts. This work highlights the need for culturally sensitive, reasoning-aware AI evaluation in complex, real-world communication.

Problem

Research questions and friction points this paper is trying to address.

Measure LLM sentiment in low-resource, culturally nuanced contexts

Evaluate LLM interpretability, robustness, and alignment with human reasoning

Assess culturally sensitive AI evaluation in real-world communication

Innovation

Methods, ideas, or system contributions that make the work stand out.

Context-dependent sentiment diagnostic framework

Human-annotated data and counterfactuals evaluation

Social-science measurement lens for LLM outputs

🔎 Similar Papers

Self-Alignment: Improving Alignment of Cultural Values in LLMs via In-Context Learning