PROBE: Diagnosing Residual Concept Capacity in Erased Text-to-Video Diffusion Models

📅 2026-03-23

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses a critical gap in existing concept erasure methods for text-to-video (T2V) diffusion models, which only verify the absence of target concepts in output frames without assessing whether internal representations are truly removed. To this end, we propose PROBE, a diagnostic protocol that quantifies the reactivation potential of erased concepts by optimizing lightweight pseudo-token embeddings under latent alignment constraints while keeping model parameters frozen. We introduce a multi-level evaluation framework—integrating classifier-based detection, semantic similarity metrics, temporal reactivation analysis, and human validation—and reveal, for the first time, that current approaches achieve only output-level suppression. Our experiments across three T2V architectures, three concept categories, and three erasure strategies demonstrate measurable residual concept capacity in all cases, with robustness strongly correlated to the depth of temporal intervention.

Technology Category

Application Category

📝 Abstract

Concept erasure techniques for text-to-video (T2V) diffusion models report substantial suppression of sensitive content, yet current evaluation is limited to checking whether the target concept is absent from generated frames, treating output-level suppression as evidence of representational removal. We introduce PROBE, a diagnostic protocol that quantifies the \textit{reactivation potential} of erased concepts in T2V models. With all model parameters frozen, PROBE optimizes a lightweight pseudo-token embedding through a denoising reconstruction objective combined with a novel latent alignment constraint that anchors recovery to the spatiotemporal structure of the original concept. We make three contributions: (1) a multi-level evaluation framework spanning classifier-based detection, semantic similarity, temporal reactivation analysis, and human validation; (2) systematic experiments across three T2V architectures, three concept categories, and three erasure strategies revealing that all tested methods leave measurable residual capacity whose robustness correlates with intervention depth; and (3) the identification of temporal re-emergence, a video-specific failure mode where suppressed concepts progressively resurface across frames, invisible to frame-level metrics. These findings suggest that current erasure methods achieve output-level suppression rather than representational removal. We release our protocol to support reproducible safety auditing. Our code is available at https://github.com/YiweiXie/PRObingBasedEvaluation.

Problem

Research questions and friction points this paper is trying to address.

concept erasure

text-to-video diffusion models

residual capacity

reactivation potential

temporal re-emergence

Innovation

Methods, ideas, or system contributions that make the work stand out.

concept erasure

text-to-video diffusion models

reactivation potential

latent alignment

temporal re-emergence

🔎 Similar Papers

No similar papers found.

Authors to Follow