The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding

๐Ÿ“… 2025-12-22
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
How to unify semantic abstraction with pixel-level fidelity in generative modeling? Existing approaches often treat semantic and pixel representations separately, leading to suboptimal trade-offs between high-level understanding and low-level reconstruction. Method: This paper identifies a spectral division of labor: semantic encoders primarily capture low-frequency abstractions, while pixel encoders preserve high-frequency details. Based on this insight, we propose the โ€œPrism Hypothesisโ€ and introduce the Unified Autoencoder (UAE)โ€”a single, compact latent space framework featuring a learnable band-modulator that jointly models semantic structure and pixel details. UAE further incorporates multi-scale feature disentanglement and hierarchical reconstruction. Contribution/Results: Evaluated on ImageNet and MS-COCO, UAE simultaneously advances both semantic understanding (e.g., classification, segmentation) and pixel-accurate reconstruction (e.g., PSNR, LPIPS), outperforming state-of-the-art methods across diverse downstream tasks and establishing new SOTA for unified representation learning.

Technology Category

Application Category

๐Ÿ“ Abstract
Deep representations across modalities are inherently intertwined. In this paper, we systematically analyze the spectral characteristics of various semantic and pixel encoders. Interestingly, our study uncovers a highly inspiring and rarely explored correspondence between an encoder's feature spectrum and its functional role: semantic encoders primarily capture low-frequency components that encode abstract meaning, whereas pixel encoders additionally retain high-frequency information that conveys fine-grained detail. This heuristic finding offers a unifying perspective that ties encoder behavior to its underlying spectral structure. We define it as the Prism Hypothesis, where each data modality can be viewed as a projection of the natural world onto a shared feature spectrum, just like the prism. Building on this insight, we propose Unified Autoencoding (UAE), a model that harmonizes semantic structure and pixel details via an innovative frequency-band modulator, enabling their seamless coexistence. Extensive experiments on ImageNet and MS-COCO benchmarks validate that our UAE effectively unifies semantic abstraction and pixel-level fidelity into a single latent space with state-of-the-art performance.
Problem

Research questions and friction points this paper is trying to address.

Harmonizes semantic and pixel representations via unified autoencoding
Unifies semantic abstraction and pixel-level fidelity in latent space
Analyzes spectral characteristics to link encoder behavior to feature spectrum
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified Autoencoding harmonizes semantic and pixel representations
Frequency-band modulator enables seamless coexistence of abstraction and detail
Single latent space unifies semantic abstraction with pixel-level fidelity
๐Ÿ”Ž Similar Papers
No similar papers found.