Distilling Latent Manifolds: Resolution Extrapolation by Variational Autoencoders

📅 2026-03-15

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work challenges the conventional belief that variational autoencoder (VAE) encoders trained exclusively at low resolution (e.g., 256²) cannot generalize to high-resolution (e.g., 512²) image reconstruction. The authors identify and validate a counterintuitive phenomenon: compact student encoders obtained via knowledge distillation—despite being trained solely on low-resolution data—achieve superior reconstruction performance on unseen high-resolution inputs. By integrating input upsampling and output downsampling strategies, the proposed approach significantly improves key metrics including PSNR, SSIM, LPIPS, and rFID on ImageNet-256. This study demonstrates for the first time that distilled VAE models can generalize across resolutions without any high-resolution training, effectively inheriting the teacher model’s high-resolution representational capabilities and thereby redefining established assumptions about out-of-distribution generalization in generative modeling.

Technology Category

Application Category

📝 Abstract

Variational Autoencoder (VAE) encoders play a critical role in modern generative models, yet their computational cost often motivates the use of knowledge distillation or quantification to obtain compact alternatives. Existing studies typically believe that the model work better on the samples closed to their training data distribution than unseen data distribution. In this work, we report a counter-intuitive phenomenon in VAE encoder distillation: a compact encoder distilled only at low resolutions exhibits poor reconstruction performance at its native resolution, but achieves dramatically improved results when evaluated at higher, unseen input resolutions. Despite never being trained beyond $256^2$ resolution, the distilled encoder generalizes effectively to $512^2$ resolution inputs, partially inheriting the teacher model's resolution preference.We further analyze latent distributions across resolutions and find that higher-resolution inputs produce latent representations more closely aligned with the teacher's manifold. Through extensive experiments on ImageNet-256, we show that simple resolution remapping-upsampling inputs before encoding and downsampling reconstructions for evaluation-leads to substantial gains across PSNR, MSE, SSIM, LPIPS, and rFID metrics. These findings suggest that VAE encoder distillation learns resolution-consistent latent manifolds rather than resolution-specific pixel mappings. This also means that the high training cost on memory, time and high-resolution datasets are not necessary conditions for distilling a VAE with high-resolution image reconstruction capabilities. On low resolution datasets, the distillation model still could learn the detailed knowledge of the teacher model in high-resolution image reconstruction.

Problem

Research questions and friction points this paper is trying to address.

Variational Autoencoder

Knowledge Distillation

Resolution Extrapolation

Latent Manifold

Cross-resolution Generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Variational Autoencoder

Knowledge Distillation

Resolution Extrapolation