🤖 AI Summary
This work addresses the propagation of social biases from training data into vision-language models and the lack of utility guarantees in existing debiasing methods. The authors propose a closed-form, training- and annotation-free solution that jointly debiases visual and textual representations in a cross-modal space, achieving Pareto-optimal fairness with a provable upper bound on utility loss—the first such guarantee in this domain. By integrating closed-form optimization, cross-modal alignment, and fairness constraints, the method supports both group and intersectional fairness across modalities and tasks. It consistently outperforms current approaches in zero-shot image classification, image-text retrieval, and generation tasks, yielding significant improvements across multiple fairness metrics while maintaining strong task performance.
📝 Abstract
While Vision-Language Models (VLMs) have achieved remarkable performance across diverse downstream tasks, recent studies have shown that they can inherit social biases from the training data and further propagate them into downstream applications. To address this issue, various debiasing approaches have been proposed, yet most of them aim to improve fairness without having a theoretical guarantee that the utility of the model is preserved. In this paper, we introduce a debiasing method that yields a \textbf{closed-form} solution in the cross-modal space, achieving Pareto-optimal fairness with \textbf{bounded utility losses}. Our method is \textbf{training-free}, requires \textbf{no annotated data}, and can jointly debias both visual and textual modalities across downstream tasks. Extensive experiments show that our method outperforms existing methods in debiasing VLMs across diverse fairness metrics and datasets for both group and \textbf{intersectional} fairness in downstream tasks such as zero-shot image classification, text-to-image retrieval, and text-to-image generation while preserving task performance.