๐ค AI Summary
How to unify semantic abstraction with pixel-level fidelity in generative modeling? Existing approaches often treat semantic and pixel representations separately, leading to suboptimal trade-offs between high-level understanding and low-level reconstruction.
Method: This paper identifies a spectral division of labor: semantic encoders primarily capture low-frequency abstractions, while pixel encoders preserve high-frequency details. Based on this insight, we propose the โPrism Hypothesisโ and introduce the Unified Autoencoder (UAE)โa single, compact latent space framework featuring a learnable band-modulator that jointly models semantic structure and pixel details. UAE further incorporates multi-scale feature disentanglement and hierarchical reconstruction.
Contribution/Results: Evaluated on ImageNet and MS-COCO, UAE simultaneously advances both semantic understanding (e.g., classification, segmentation) and pixel-accurate reconstruction (e.g., PSNR, LPIPS), outperforming state-of-the-art methods across diverse downstream tasks and establishing new SOTA for unified representation learning.
๐ Abstract
Deep representations across modalities are inherently intertwined. In this paper, we systematically analyze the spectral characteristics of various semantic and pixel encoders. Interestingly, our study uncovers a highly inspiring and rarely explored correspondence between an encoder's feature spectrum and its functional role: semantic encoders primarily capture low-frequency components that encode abstract meaning, whereas pixel encoders additionally retain high-frequency information that conveys fine-grained detail. This heuristic finding offers a unifying perspective that ties encoder behavior to its underlying spectral structure. We define it as the Prism Hypothesis, where each data modality can be viewed as a projection of the natural world onto a shared feature spectrum, just like the prism. Building on this insight, we propose Unified Autoencoding (UAE), a model that harmonizes semantic structure and pixel details via an innovative frequency-band modulator, enabling their seamless coexistence. Extensive experiments on ImageNet and MS-COCO benchmarks validate that our UAE effectively unifies semantic abstraction and pixel-level fidelity into a single latent space with state-of-the-art performance.