🤖 AI Summary
This work challenges the conventional dichotomy between discriminative and generative models, providing the first theoretical proof that standard discriminative models—such as CLIP—implicitly encode rich generative knowledge. To harness this, we propose Direct Ascent Synthesis (DAS), a training-free, multi-scale latent-space optimization method: it performs concurrent gradient ascent across resolution levels (1×1 to 224×224), integrating hierarchical latent inversion with natural-image 1/f² spectral priors. DAS enables zero-shot text-to-image generation and cross-domain style transfer without any model fine-tuning or adversarial training. Remarkably, it achieves image fidelity comparable to dedicated generative models while markedly suppressing artifacts and rigorously preserving human visual statistics—e.g., natural-scene spectral decay and structural coherence. By eliminating reliance on explicit generative architectures or adversarial objectives, DAS redefines the boundary between discriminative and generative modeling, demonstrating that high-fidelity synthesis can emerge directly from off-the-shelf discriminative representations.
📝 Abstract
We demonstrate that discriminative models inherently contain powerful generative capabilities, challenging the fundamental distinction between discriminative and generative architectures. Our method, Direct Ascent Synthesis (DAS), reveals these latent capabilities through multi-resolution optimization of CLIP model representations. While traditional inversion attempts produce adversarial patterns, DAS achieves high-quality image synthesis by decomposing optimization across multiple spatial scales (1x1 to 224x224), requiring no additional training. This approach not only enables diverse applications -- from text-to-image generation to style transfer -- but maintains natural image statistics ($1/f^2$ spectrum) and guides the generation away from non-robust adversarial patterns. Our results demonstrate that standard discriminative models encode substantially richer generative knowledge than previously recognized, providing new perspectives on model interpretability and the relationship between adversarial examples and natural image synthesis.