🤖 AI Summary
Current LLM fairness evaluations adopt narrow, task-closed perspectives, failing to uncover latent societal biases. To address this, we propose a systematic governance framework: (1) a hierarchical theoretical model encompassing multidimensional social attributes (e.g., gender, race, geography, education); (2) GFAIR—the first open-generation benchmark for group fairness—featuring a novel “statement organization” evaluation task that explicitly exposes model biases in group-associative reasoning; and (3) GF-THINK, a chain-of-thought debiasing method integrating social-attribute awareness with progressive calibration. Empirical evaluation across mainstream LLMs reveals substantial group-level fairness risks. GF-THINK improves fairness metrics by an average of 32.7%. Both the codebase and the GFAIR dataset are publicly released.
📝 Abstract
The need to assess LLMs for bias and fairness is critical, with current evaluations often being narrow, missing a broad categorical view. In this paper, we propose evaluating the bias and fairness of LLMs from a group fairness lens using a novel hierarchical schema characterizing diverse social groups. Specifically, we construct a dataset, GFAIR, encapsulating target-attribute combinations across multiple dimensions. Moreover, we introduce statement organization, a new open-ended text generation task, to uncover complex biases in LLMs. Extensive evaluations of popular LLMs reveal inherent safety concerns. To mitigate the biases of LLMs from a group fairness perspective, we pioneer a novel chainof-thought method GF-THINK to mitigate biases of LLMs from a group fairness perspective. Experimental results demonstrate its efficacy in mitigating bias and achieving fairness in LLMs. Our dataset and codes are available at https://github.com/surika/Group-Fairness-LLMs.