Extracting PAC Decision Trees from Black Box Binary Classifiers: The Gender Bias Study Case on BERT-based Language Models

📅 2024-12-13

🏛️ AAAI Conference on Artificial Intelligence

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Black-box binary classifiers—such as BERT—lack interpretability, hindering bias auditing and trustworthy deployment. Method: This paper proposes a decision-tree extraction framework grounded in PAC (Probably Approximately Correct) learning theory, the first to systematically integrate PAC fidelity guarantees into surrogate model construction. It extends ID3/CART by incorporating text feature discretization and sensitive attribute analysis to generate statistically verifiable, faithful decision trees. Contribution/Results: Applied to occupational gender bias detection in BERT, the method successfully extracts PAC-guaranteed decision trees across multiple BERT variants. It enables interpretable localization of bias patterns, fine-grained attribution, and visual diagnostic analysis—accurately identifying high-bias occupations. The approach significantly enhances transparency and credibility in model bias assessment, providing formal guarantees on approximation accuracy while preserving functional equivalence to the original black-box classifier.

Technology Category

Application Category

📝 Abstract

Decision trees are a popular machine learning method, valued for their inherent explainability. In Explainable AI, decision trees serve as surrogate models for complex black box AI models or as approximations of parts of such models. A key challenge of this approach is assessing how accurately the extracted decision tree represents the original model and determining the extent to which it can be trusted as an approximation of its behaviour. In this work, we investigate the use of the Probably Approximately Correct (PAC) framework to provide a theoretical guarantee of fidelity for decision trees extracted from AI models. Leveraging the theoretical foundations of the PAC framework, we adapt a decision tree algorithm to ensure a PAC guarantee under specific conditions. We focus on binary classification and conduct experiments where we extract decision trees from BERT-based language models with PAC guarantees. Our results indicate occupational gender bias in these models, which confirm previous results in the literature. Additionally, the decision tree format enhances the visualization of which occupations are most impacted by social bias.

Problem

Research questions and friction points this paper is trying to address.

Extracting interpretable decision trees from black box classifiers

Providing theoretical guarantees for surrogate model fidelity

Detecting gender bias in BERT language models through extraction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extracting PAC decision trees from black box classifiers

Providing theoretical fidelity guarantees via PAC framework

Adapting decision tree algorithm for binary classification

🔎 Similar Papers

No similar papers found.