A Tilted Seesaw: Revisiting Autoencoder Trade-off for Controllable Diffusion

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a critical gap in the evaluation of autoencoders (AEs) within controllable diffusion models, where reliance on generation-centric metrics like gFID overlooks reconstruction fidelity, leading to condition drift and degraded controllability. The authors propose a multidimensional protocol for assessing condition drift and, through theoretical analysis and ControlNet-based experiments, demonstrate that gFID exhibits weak correlation with actual controllability. In contrast, instance-level reconstruction metrics more accurately reflect the alignment between conditions and generated outputs. Empirical studies across multiple ImageNet-trained AEs confirm that reconstruction-oriented metrics reliably predict condition preservation performance. The findings reveal a misalignment between the prevailing gFID-centric AE selection paradigm and the requirements of controllable generation, offering a more principled evaluation benchmark and selection guidance for controllable diffusion systems.

Technology Category

Application Category

📝 Abstract
In latent diffusion models, the autoencoder (AE) is typically expected to balance two capabilities: faithful reconstruction and a generation-friendly latent space (e.g., low gFID). In recent ImageNet-scale AE studies, we observe a systematic bias toward generative metrics in handling this trade-off: reconstruction metrics are increasingly under-reported, and ablation-based AE selection often favors the best-gFID configuration even when reconstruction fidelity degrades. We theoretically analyze why this gFID-dominant preference can appear unproblematic for ImageNet generation, yet becomes risky when scaling to controllable diffusion: AEs can induce condition drift, which limits achievable condition alignment. Meanwhile, we find that reconstruction fidelity, especially instance-level measures, better indicates controllability. We empirically validate the impact of tilted autoencoder evaluation on controllability by studying several recent ImageNet AEs. Using a multi-dimensional condition-drift evaluation protocol reflecting controllable generation tasks, we find that gFID is only weakly predictive of condition preservation, whereas reconstruction-oriented metrics are substantially more aligned. ControlNet experiments further confirm that controllability tracks condition preservation rather than gFID. Overall, our results expose a gap between ImageNet-centric AE evaluation and the requirements of scalable controllable diffusion, offering practical guidance for more reliable benchmarking and model selection.
Problem

Research questions and friction points this paper is trying to address.

autoencoder
controllable diffusion
condition drift
reconstruction fidelity
gFID
Innovation

Methods, ideas, or system contributions that make the work stand out.

autoencoder evaluation
controllable diffusion
condition drift
reconstruction fidelity
gFID
🔎 Similar Papers
No similar papers found.