🤖 AI Summary
Evaluating LLMs’ sentiment understanding in low-resource, culturally complex settings remains challenging—particularly for informal, Swahili-English code-mixed messages used by youth health communities on WhatsApp in Nairobi, where sentiment is context-dependent and culturally embedded.
Method: We propose the first diagnostic evaluation framework for social science measurement, integrating counterfactual sentiment flipping, calibrated interpretability assessment, and expert-annotated data to systematically probe model performance on ambiguity resolution, cross-contextual sentiment transfer, and reasoning consistency.
Contribution/Results: Results reveal strong robustness in top-tier closed-source models, whereas open-source models exhibit significant degradation under cultural ambiguity or contextual shifts. This work pioneers the integration of measurement validity principles into LLM sentiment evaluation, establishing both theoretical grounding and methodological foundations for developing culturally adaptive, traceable, and inference-aware AI assessment systems.
📝 Abstract
Sentiment analysis in low-resource, culturally nuanced contexts challenges conventional NLP approaches that assume fixed labels and universal affective expressions. We present a diagnostic framework that treats sentiment as a context-dependent, culturally embedded construct, and evaluate how large language models (LLMs) reason about sentiment in informal, code-mixed WhatsApp messages from Nairobi youth health groups. Using a combination of human-annotated data, sentiment-flipped counterfactuals, and rubric-based explanation evaluation, we probe LLM interpretability, robustness, and alignment with human reasoning. Framing our evaluation through a social-science measurement lens, we operationalize and interrogate LLMs outputs as an instrument for measuring the abstract concept of sentiment. Our findings reveal significant variation in model reasoning quality, with top-tier LLMs demonstrating interpretive stability, while open models often falter under ambiguity or sentiment shifts. This work highlights the need for culturally sensitive, reasoning-aware AI evaluation in complex, real-world communication.