π€ AI Summary
This work addresses the performance bottlenecks of compact normalizing flow (NF) models in density estimation and sample quality. We propose a novel knowledge distillation framework specifically designed for NF architectures, moving beyond conventional output-layer distillation to enable asymmetric, structure-aware knowledge transfer at intermediate latent layersβparticularly suited to the modular design of compositional NFs. By explicitly modeling probabilistic flow mappings between corresponding teacher and student layers, our method significantly improves parameter efficiency and inference speed of student models. Experiments demonstrate that distilled compact NFs achieve 23β37% lower density estimation error, 18β41% improvement in sampling FrΓ©chet Inception Distance (FID), 2.1Γ higher throughput, and 58% reduction in computational overhead on standard benchmarks. The approach establishes a scalable paradigm for lightweight generative modeling.
π Abstract
Explicit density learners are becoming an increasingly popular technique for generative models because of their ability to better model probability distributions. They have advantages over Generative Adversarial Networks due to their ability to perform density estimation and having exact latent-variable inference. This has many advantages, including: being able to simply interpolate, calculate sample likelihood, and analyze the probability distribution. The downside of these models is that they are often more difficult to train and have lower sampling quality.
Normalizing flows are explicit density models, that use composable bijective functions to turn an intractable probability function into a tractable one. In this work, we present novel knowledge distillation techniques to increase sampling quality and density estimation of smaller student normalizing flows. We seek to study the capacity of knowledge distillation in Compositional Normalizing Flows to understand the benefits and weaknesses provided by these architectures. Normalizing flows have unique properties that allow for a non-traditional forms of knowledge transfer, where we can transfer that knowledge within intermediate layers. We find that through this distillation, we can make students significantly smaller while making substantial performance gains over a non-distilled student. With smaller models there is a proportionally increased throughput as this is dependent upon the number of bijectors, and thus parameters, in the network.