π€ AI Summary
Current AI evaluation frameworks struggle to detect subtle risks in caregiving contexts, such as emotional neglect, bias, or inappropriate information in large language model (LLM) responses. This work proposes RubRIXβthe first user-centered evaluation framework grounded in care ethics theory and validated by clinical expertsβwhich translates ethical principles into five actionable risk dimensions and incorporates human-guided scoring rules to steer model refinement. Evaluations of six mainstream LLMs on over 20,000 real-world caregiver queries demonstrate that a single round of optimization guided by RubRIX reduces these risks by 45%β98%, substantially enhancing the safety and reliability of AI systems in high-stakes caregiving interactions.
π Abstract
Caregivers seeking AI-mediated support express complex needs -- information-seeking, emotional validation, and distress cues -- that warrant careful evaluation of response safety and appropriateness. Existing AI evaluation frameworks, primarily focused on general risks (toxicity, hallucinations, policy violations, etc), may not adequately capture the nuanced risks of LLM-responses in caregiving-contexts. We introduce RubRIX (Rubric-based Risk Index), a theory-driven, clinician-validated framework for evaluating risks in LLM caregiving responses. Grounded in the Elements of an Ethic of Care, RubRIX operationalizes five empirically-derived risk dimensions: Inattention, Bias&Stigma, Information Inaccuracy, Uncritical Affirmation, and Epistemic Arrogance. We evaluate six state-of-the-art LLMs on over 20,000 caregiver queries from Reddit and ALZConnected. Rubric-guided refinement consistently reduced risk-components by 45-98% after one iteration across models. This work contributes a methodological approach for developing domain-sensitive, user-centered evaluation frameworks for high-burden contexts. Our findings highlight the importance of domain-sensitive, interactional risk evaluation for the responsible deployment of LLMs in caregiving support contexts. We release benchmark datasets to enable future research on contextual risk evaluation in AI-mediated support.