🤖 AI Summary
False information detection often suffers from bias due to model reliance on spurious, redundant features—such as URLs, emojis, and named entities—leading to poor generalization. To address this, we propose *interpretability-driven pseudonymization*, a novel paradigm that systematically identifies and removes confounding textual elements via SHAP-based attribution analysis—the first such application in this context. We integrate named entity recognition with controllable pseudonymization during preprocessing to mitigate entity-level bias. Evaluated on BERT-based text classifiers across multiple datasets, our method achieves an average 65.78% improvement in external test performance while maintaining internal accuracy, significantly enhancing model robustness and cross-domain generalizability. Our core contribution lies in establishing a synergistic mechanism between interpretability analysis and data preprocessing, offering a reproducible, scalable pathway for debiasing false information detection systems.
📝 Abstract
The automatic detection of disinformation presents a significant challenge in the field of natural language processing. This task addresses a multifaceted societal and communication issue, which needs approaches that extend beyond the identification of general linguistic patterns through data-driven algorithms. In this research work, we hypothesise that text classification methods are not able to capture the nuances of disinformation and they often ground their decision in superfluous features. Hence, we apply a post-hoc explainability method (SHAP, SHapley Additive exPlanations) to identify spurious elements with high impact on the classification models. Our findings show that non-informative elements (e.g., URLs and emoticons) should be removed and named entities (e.g., Rwanda) should be pseudo-anonymized before training to avoid models' bias and increase their generalization capabilities. We evaluate this methodology with internal dataset and external dataset before and after applying extended data preprocessing and named entity replacement. The results show that our proposal enhances on average the performance of a disinformation classification method with external test data in 65.78% without a significant decrease of the internal test performance.