🤖 AI Summary
Normalizing flows (NFs) suffer from limited semantic representational capacity due to reliance on log-likelihood optimization, resulting in suboptimal generation quality. To address this, we propose Reverse Representation Alignment (RRA): leveraging the invertibility of NFs, RRA maps latent variables backward into the semantic embedding space of a frozen, pre-trained vision foundation model (e.g., CLIP or DINO), enabling unsupervised, model-free semantic alignment during generation. Furthermore, we introduce a training-free, test-time classification optimization algorithm that dynamically refines semantic consistency of generated samples. Unlike conventional forward regularization paradigms, RRA operates in reverse semantic space, significantly enhancing both semantic expressiveness and fidelity. Our method establishes new state-of-the-art performance for NFs on ImageNet at 64×64 and 256×256 resolutions, accelerates training by 3.3×, and simultaneously improves FID scores and classification accuracy.
📝 Abstract
Normalizing Flows (NFs) are a class of generative models distinguished by a mathematically invertible architecture, where the forward pass transforms data into a latent space for density estimation, and the reverse pass generates new samples from this space. This characteristic creates an intrinsic synergy between representation learning and data generation. However, the generative quality of standard NFs is limited by poor semantic representations from log-likelihood optimization. To remedy this, we propose a novel alignment strategy that creatively leverages the invertibility of NFs: instead of regularizing the forward pass, we align the intermediate features of the generative (reverse) pass with representations from a powerful vision foundation model, demonstrating superior effectiveness over naive alignment. We also introduce a novel training-free, test-time optimization algorithm for classification, which provides a more intrinsic evaluation of the NF's embedded semantic knowledge. Comprehensive experiments demonstrate that our approach accelerates the training of NFs by over 3.3$ imes$, while simultaneously delivering significant improvements in both generative quality and classification accuracy. New state-of-the-art results for NFs are established on ImageNet 64$ imes$64 and 256$ imes$256. Our code is available at https://github.com/MCG-NJU/FlowBack.