🤖 AI Summary
Existing error diagnosis methods for machine learning models struggle to identify semantically coherent error patterns, disentangle deep-rooted bias sources, and rely heavily on manual annotations and predefined attributes—limiting their applicability in specialized domains such as medical imaging. This paper proposes a language-driven error diagnosis paradigm: leveraging large language models (LLMs) to autonomously generate verifiable hypotheses from textual model outputs, enabling unsupervised, prior-free error slice discovery; introducing pseudo-attribute synthesis and cross-subgroup collaborative correction to internalize LLMs’ implicit domain knowledge for fine-grained bias attribution; and establishing a multimodal bias evaluation framework. Evaluated across six diverse datasets—including natural and radiological imaging—the method consistently outperforms over 200 baseline approaches in error slice completeness, semantic coherence, and error mitigation efficacy.
📝 Abstract
Error slice discovery is crucial to diagnose and mitigate model errors. Current clustering or discrete attribute-based slice discovery methods face key limitations: 1) clustering results in incoherent slices, while assigning discrete attributes to slices leads to incomplete coverage of error patterns due to missing or insufficient attributes; 2) these methods lack complex reasoning, preventing them from fully explaining model biases; 3) they fail to integrate extit{domain knowledge}, limiting their usage in specialized fields eg radiology. We proposeladder (underline{La}nguage-underline{D}riven underline{D}iscovery and underline{E}rror underline{R}ectification), to address the limitations by: (1) leveraging the flexibility of natural language to address incompleteness, (2) employing LLM's latent extit{domain knowledge} and advanced reasoning to analyze sentences and derive testable hypotheses directly, identifying biased attributes, and form coherent error slices without clustering. Existing mitigation methods typically address only the worst-performing group, often amplifying errors in other subgroups. In contrast,ladder generates pseudo attributes from the discovered hypotheses to mitigate errors across all biases without explicit attribute annotations or prior knowledge of bias. Rigorous evaluations on 6 datasets spanning natural and medical images -- comparing 200+ classifiers with diverse architectures, pretraining strategies, and LLMs -- show thatladder consistently outperforms existing baselines in discovering and mitigating biases.