🤖 AI Summary
This work addresses the unfair judgments exhibited by large language models in toxicity assessment across different demographic groups, particularly in nuanced expressions such as implicit hate speech. To mitigate this issue without modifying model parameters, the authors propose FairToT, a framework that leverages prompt engineering during inference to dynamically identify potentially biased predictions and trigger interpretable fairness metrics for real-time intervention. This approach enables on-the-fly fairness optimization without requiring model retraining. Experimental results demonstrate that FairToT significantly reduces performance disparities across demographic groups on multiple benchmark datasets while preserving the accuracy and stability of toxicity predictions.
📝 Abstract
Large Language Models (LLMs) are increasingly used for toxicity assessment in online moderation systems, where fairness across demographic groups is essential for equitable treatment. However, LLMs often produce inconsistent toxicity judgements for subtle expressions, particularly those involving implicit hate speech, revealing underlying biases that are difficult to correct through standard training. This raises a key question that existing approaches often overlook: when should corrective mechanisms be invoked to ensure fair and reliable assessments? To address this, we propose FairToT, an inference-time framework that enhances LLM fairness through prompt-guided toxicity assessment. FairToT identifies cases where demographic-related variation is likely to occur and determines when additional assessment should be applied. In addition, we introduce two interpretable fairness indicators that detect such cases and improve inference consistency without modifying model parameters. Experiments on benchmark datasets show that FairToT reduces group-level disparities while maintaining stable and reliable toxicity predictions, demonstrating that inference-time refinement offers an effective and practical approach for fairness improvement in LLM-based toxicity assessment systems. The source code can be found at https://aisuko.github.io/fair-tot/.