🤖 AI Summary
Large language models (LLMs) are prone to bias inherited from instruction-tuning datasets, degrading their generalization performance. Existing debiasing methods rely heavily on human priors or in-context learning, limiting adaptability across diverse bias types. This paper proposes the first autonomous debiasing framework integrating information theory and causal inference: it quantifies bias impact via information gain, constructs a structural causal model (SCM), and performs counterfactual intervention to automatically reweight the data distribution prior to standard supervised fine-tuning. The method requires no manual annotations, external knowledge bases, or predefined bias assumptions, enabling adaptive correction of heterogeneous biases. Evaluated on multiple benchmarks, our approach significantly improves model generalization and reduces bias metrics by an average of 32.7%.
📝 Abstract
Despite significant progress, recent studies indicate that current large language models (LLMs) may still capture dataset biases and utilize them during inference, leading to the poor generalizability of LLMs. However, due to the diversity of dataset biases and the insufficient nature of bias suppression based on in-context learning, the effectiveness of previous prior knowledge-based debiasing methods and in-context learning based automatic debiasing methods is limited. To address these challenges, we explore the combination of causal mechanisms with information theory and propose an information gain-guided causal intervention debiasing (IGCIDB) framework. This framework first utilizes an information gain-guided causal intervention method to automatically and autonomously balance the distribution of instruction-tuning dataset. Subsequently, it employs a standard supervised fine-tuning process to train LLMs on the debiased dataset. Experimental results show that IGCIDB can effectively debias LLM to improve its generalizability across different tasks.