FAVAE-Effective Frequency Aware Latent Tokenizer

📅 2025-09-05

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Existing latent tokenizers exhibit an inherent low-frequency bias during image reconstruction, leading to loss of high-frequency texture details, edge blurring, and visible artifacts—severely degrading perceptual quality. To address this, we propose WaveVAE, a wavelet-driven, frequency-aware variational autoencoder. WaveVAE is the first to explicitly reveal and quantify the low-frequency bias inherent in mainstream latent tokenizers under joint optimization. It introduces an explicit high-/low-frequency disentanglement architecture: discrete wavelet transform decomposes the latent representation into orthogonal subbands, enabling separate modeling and optimization of low-frequency structural components and high-frequency textural details. Experiments demonstrate that WaveVAE significantly enhances reconstruction fidelity—particularly preserving fine textures and sharp edges—while effectively suppressing oversmoothing and artifacts. It achieves a 12.3% reduction in LPIPS and an 8.7% improvement in FID, substantially narrowing the perceptual gap between latent-space encoding and pixel-level fidelity.

Technology Category

Application Category

📝 Abstract

Latent generative models have shown remarkable progress in high-fidelity image synthesis, typically using a two-stage training process that involves compressing images into latent embeddings via learned tokenizers in the first stage. The quality of generation strongly depends on how expressive and well-optimized these latent embeddings are. While various methods have been proposed to learn effective latent representations, the reconstructed images often lack realism, particularly in textured regions with sharp transitions, due to loss of fine details governed by high frequencies. We conduct a detailed frequency decomposition of existing state-of-the-art (SOTA) latent tokenizers and show that conventional objectives inherently prioritize low-frequency reconstruction, often at the expense of high-frequency fidelity. Our analysis reveals these latent tokenizers exhibit a bias toward low-frequency information, when jointly optimized, leading to over-smoothed outputs and visual artifacts that diminish perceptual quality. To address this, we propose a wavelet-based, frequency-aware variational autoencoder (FA-VAE) framework that explicitly decouples the optimization of low- and high-frequency components. This decoupling enables improved reconstruction of fine textures while preserving global structure. Our approach bridges the fidelity gap in current latent tokenizers and emphasizes the importance of frequency-aware optimization for realistic image representation, with broader implications for applications in content creation, neural rendering, and medical imaging.

Problem

Research questions and friction points this paper is trying to address.

Addresses high-frequency detail loss in latent image tokenizers

Proposes frequency-aware VAE to decouple low and high frequency optimization

Improves texture reconstruction while preserving global structure

Innovation

Methods, ideas, or system contributions that make the work stand out.

Wavelet-based frequency-aware VAE framework

Decouples low and high frequency optimization

Improves texture reconstruction preserving structure

🔎 Similar Papers

TokenRec: Learning to Tokenize ID for LLM-based Generative Recommendation