Measuring Dejavu Memorization Efficiently

📅 2025-04-08

🏛️ Neural Information Processing Systems

📈 Citations: 1

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Existing methods for evaluating memorization in representation learning models—such as the DejaVu phenomenon—require training multiple models, rendering them infeasible for large-scale open-source models. Method: We propose a lightweight, zero-shot, retraining-free evaluation framework that models data-level correlations via single-model feature statistics and integrates a background–foreground memory estimation protocol inspired by the DejaVu paradigm. Contribution/Results: Our method enables the first efficient, scalable, and architecture-agnostic quantification of memorization in mainstream open-source vision and multimodal representation models—including CLIP and DINOv2. Extensive experiments demonstrate high consistency across diverse metrics and reveal that large open-source models exhibit significantly lower overall memorization than comparably sized models trained on subsets of the same data. This breakthrough overcomes the long-standing scalability bottleneck in memorization assessment for large foundation models.

Technology Category

Application Category

📝 Abstract

Recent research has shown that representation learning models may accidentally memorize their training data. For example, the d'ej`a vu method shows that for certain representation learning models and training images, it is sometimes possible to correctly predict the foreground label given only the representation of the background - better than through dataset-level correlations. However, their measurement method requires training two models - one to estimate dataset-level correlations and the other to estimate memorization. This multiple model setup becomes infeasible for large open-source models. In this work, we propose alternative simple methods to estimate dataset-level correlations, and show that these can be used to approximate an off-the-shelf model's memorization ability without any retraining. This enables, for the first time, the measurement of memorization in pre-trained open-source image representation and vision-language representation models. Our results show that different ways of measuring memorization yield very similar aggregate results. We also find that open-source models typically have lower aggregate memorization than similar models trained on a subset of the data. The code is available both for vision and vision language models.

Problem

Research questions and friction points this paper is trying to address.

Detect memorization in representation learning models

Simplify measurement without multiple model training

Assess memorization in pre-trained open-source models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Estimates dataset-level correlations simply

Approximates memorization without model retraining

Measures memorization in pre-trained models

🔎 Similar Papers

Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon