🤖 AI Summary
Deep learning models often inherit implicit biases, and existing unsupervised debiasing methods rely on latent-space clustering to generate pseudo-labels—lacking semantic interpretability and hindering expert validation. To address this, we propose the first semantics-driven framework for diagnosing implicit bias: it requires neither human annotations nor prior assumptions about bias types. Our approach leverages text-guided latent-space disentanglement, task-relevance distillation, bias-semantic mapping, and a generative explanation module to automatically identify and semantically name non-representative bias features actually learned by the model. This enables explicit, interpretable bias identification and attribution, supporting both in-training intervention and post-hoc verification. Evaluated across multiple benchmarks, our method significantly improves bias detection accuracy and interpretability, demonstrating strong generalizability and practical operability.
📝 Abstract
In the last few years, due to the broad applicability of deep learning to downstream tasks and end-to-end training capabilities, increasingly more concerns about potential biases to specific, non-representative patterns have been raised. Many works focusing on unsupervised debiasing usually leverage the tendency of deep models to learn ``easier'' samples, for example by clustering the latent space to obtain bias pseudo-labels. However, the interpretation of such pseudo-labels is not trivial, especially for a non-expert end user, as it does not provide semantic information about the bias features. To address this issue, we introduce ``Say My Name'' (SaMyNa), the first tool to identify biases within deep models semantically. Unlike existing methods, our approach focuses on biases learned by the model. Our text-based pipeline enhances explainability and supports debiasing efforts: applicable during either training or post-hoc validation, our method can disentangle task-related information and proposes itself as a tool to analyze biases. Evaluation on traditional benchmarks demonstrates its effectiveness in detecting biases and even disclaiming them, showcasing its broad applicability for model diagnosis.