🤖 AI Summary
This work addresses a critical gap in the evaluation of autoencoders (AEs) within controllable diffusion models, where reliance on generation-centric metrics like gFID overlooks reconstruction fidelity, leading to condition drift and degraded controllability. The authors propose a multidimensional protocol for assessing condition drift and, through theoretical analysis and ControlNet-based experiments, demonstrate that gFID exhibits weak correlation with actual controllability. In contrast, instance-level reconstruction metrics more accurately reflect the alignment between conditions and generated outputs. Empirical studies across multiple ImageNet-trained AEs confirm that reconstruction-oriented metrics reliably predict condition preservation performance. The findings reveal a misalignment between the prevailing gFID-centric AE selection paradigm and the requirements of controllable generation, offering a more principled evaluation benchmark and selection guidance for controllable diffusion systems.
📝 Abstract
In latent diffusion models, the autoencoder (AE) is typically expected to balance two capabilities: faithful reconstruction and a generation-friendly latent space (e.g., low gFID). In recent ImageNet-scale AE studies, we observe a systematic bias toward generative metrics in handling this trade-off: reconstruction metrics are increasingly under-reported, and ablation-based AE selection often favors the best-gFID configuration even when reconstruction fidelity degrades. We theoretically analyze why this gFID-dominant preference can appear unproblematic for ImageNet generation, yet becomes risky when scaling to controllable diffusion: AEs can induce condition drift, which limits achievable condition alignment. Meanwhile, we find that reconstruction fidelity, especially instance-level measures, better indicates controllability. We empirically validate the impact of tilted autoencoder evaluation on controllability by studying several recent ImageNet AEs. Using a multi-dimensional condition-drift evaluation protocol reflecting controllable generation tasks, we find that gFID is only weakly predictive of condition preservation, whereas reconstruction-oriented metrics are substantially more aligned. ControlNet experiments further confirm that controllability tracks condition preservation rather than gFID. Overall, our results expose a gap between ImageNet-centric AE evaluation and the requirements of scalable controllable diffusion, offering practical guidance for more reliable benchmarking and model selection.