No Caption, No Problem: Caption-Free Membership Inference via Model-Fitted Embeddings

📅 2026-02-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing membership inference attacks, which rely on ground-truth textual descriptions and are thus inapplicable in realistic image-only scenarios. The authors propose MoFit, the first framework capable of performing membership inference without any real or generated text prompts. MoFit employs a two-stage strategy—model-fitting proxy optimization followed by proxy-driven embedding extraction—to construct synthetic conditional inputs that overfit the generative manifold of the target diffusion model, thereby amplifying the discrepancy in conditional loss responses between member and non-member samples. Extensive experiments demonstrate that MoFit significantly outperforms vision-language-model-based baselines across multiple datasets and diffusion architectures, achieving performance comparable to state-of-the-art methods that require access to ground-truth text descriptions.

Technology Category

Application Category

📝 Abstract
Latent diffusion models have achieved remarkable success in high-fidelity text-to-image generation, but their tendency to memorize training data raises critical privacy and intellectual property concerns. Membership inference attacks (MIAs) provide a principled way to audit such memorization by determining whether a given sample was included in training. However, existing approaches assume access to ground-truth captions. This assumption fails in realistic scenarios where only images are available and their textual annotations remain undisclosed, rendering prior methods ineffective when substituted with vision-language model (VLM) captions. In this work, we propose MoFit, a caption-free MIA framework that constructs synthetic conditioning inputs that are explicitly overfitted to the target model's generative manifold. Given a query image, MoFit proceeds in two stages: (i) model-fitted surrogate optimization, where a perturbation applied to the image is optimized to construct a surrogate in regions of the model's unconditional prior learned from member samples, and (ii) surrogate-driven embedding extraction, where a model-fitted embedding is derived from the surrogate and then used as a mismatched condition for the query image. This embedding amplifies conditional loss responses for member samples while leaving hold-outs relatively less affected, thereby enhancing separability in the absence of ground-truth captions. Our comprehensive experiments across multiple datasets and diffusion models demonstrate that MoFit consistently outperforms prior VLM-conditioned baselines and achieves performance competitive with caption-dependent methods.
Problem

Research questions and friction points this paper is trying to address.

membership inference
latent diffusion models
caption-free
privacy
model memorization
Innovation

Methods, ideas, or system contributions that make the work stand out.

caption-free
membership inference
latent diffusion models
model-fitted embedding
surrogate optimization
🔎 Similar Papers
No similar papers found.
J
Joonsung Jeon
Korea Advanced Institute of Science and Technology (KAIST)
W
Woo Jae Kim
Korea Advanced Institute of Science and Technology (KAIST)
S
Suhyeon Ha
Korea Advanced Institute of Science and Technology (KAIST)
Sooel Son
Sooel Son
KAIST
Web SecurityPrivacyProgram analysis
Sung-Eui Yoon
Sung-Eui Yoon
Professor of Dept. of Computer Science, KAIST
GraphicsVisionRobotics