WASP: A Weight-Space Approach to Detecting Learned Spuriousness

📅 2024-10-24

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the problem of fragile generalization in machine learning models caused by spurious correlations. We propose the first spurious correlation detection method operating in weight space—rather than input space or error analysis—enabling earlier and more fundamental bias identification. Our core methodological contribution lies in shifting analytical focus from model outputs to the evolution trajectory of pretrained model weights during fine-tuning, coupled with cross-modal feature disentanglement to uncover implicit biases not explicitly manifested in training or validation sets. The approach is applicable to multimodal (image-text) settings and successfully identifies previously unknown spurious patterns on ImageNet-1k. Experiments across multiple benchmarks demonstrate high-precision localization of unseen spurious correlations, empirically validating that weight-space dynamics encode bias signals earlier and at a deeper level than observable prediction behavior.

Technology Category

Application Category

📝 Abstract

It is of crucial importance to train machine learning models such that they clearly understand what defines each class in a given task. Though there is a sum of works dedicated to identifying the spurious correlations featured by a dataset that may impact the model's understanding of the classes, all current approaches rely solely on data or error analysis. That is, they cannot point out spurious correlations learned by the model that are not already pointed out by the counterexamples featured in the validation or training sets. We propose a method that transcends this limitation, switching the focus from analyzing a model's predictions to analyzing the model's weights, the mechanism behind the making of the decisions, which proves to be more insightful. Our proposed Weight-space Approach to detecting Spuriousness (WASP) relies on analyzing the weights of foundation models as they drift towards capturing various (spurious) correlations while being fine-tuned on a given dataset. We demonstrate that different from previous works, our method (i) can expose spurious correlations featured by a dataset even when they are not exposed by training or validation counterexamples, (ii) it works for multiple modalities such as image and text, and (iii) it can uncover previously untapped spurious correlations learned by ImageNet-1k classifiers.

Problem

Research questions and friction points this paper is trying to address.

Detects spurious correlations in models

Analyzes model weights, not predictions

Applies to images and text modalities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes model weights directly

Detects hidden spurious correlations

Applies to multiple data modalities

🔎 Similar Papers

Spurious Correlations in Machine Learning: A Survey