🤖 AI Summary
Existing fairness evaluation methods for LLM-based recommender systems inadequately address dual-dimensional bias—across both psychological traits (e.g., Big Five personality dimensions) and sensitive demographic attributes (e.g., gender, race, age; eight categories total).
Method: We propose PAFS (Personality-Aware Fairness Score), the first user-level fairness assessment framework integrating psychological and sociodemographic attributes. It introduces Big Five personality traits into fairness modeling, leverages prompt engineering with ChatGPT-4o and Gemini-1.5-Flash, conducts controlled comparative recommendation experiments, and combines statistical bias analysis with personality embedding modeling for fine-grained bias quantification.
Contribution/Results: PAFS achieves high reliability (0.9969–0.9997) and detects up to 34.79% inter-group recommendation disparity—significantly surpassing conventional demographic-only approaches. Results empirically validate that prompt design critically influences fairness outcomes, establishing a novel paradigm beyond traditional demographic-centric evaluation.
📝 Abstract
Recent advances in Large Language Models (LLMs) have enabled their application to recommender systems (RecLLMs), yet concerns remain regarding fairness across demographic and psychological user dimensions. We introduce FairEval, a novel evaluation framework to systematically assess fairness in LLM-based recommendations. FairEval integrates personality traits with eight sensitive demographic attributes,including gender, race, and age, enabling a comprehensive assessment of user-level bias. We evaluate models, including ChatGPT 4o and Gemini 1.5 Flash, on music and movie recommendations. FairEval's fairness metric, PAFS, achieves scores up to 0.9969 for ChatGPT 4o and 0.9997 for Gemini 1.5 Flash, with disparities reaching 34.79 percent. These results highlight the importance of robustness in prompt sensitivity and support more inclusive recommendation systems.