π€ AI Summary
This work addresses the optimization conflict between high-fidelity reconstruction and semantic abstraction in visual tokenization, which arises from manifold misalignment. The authors propose MUSE, a novel framework that introduces topological orthogonality for the first time, leveraging structural information as an orthogonal bridge to decouple these dual objectives within a Transformer architecture. Specifically, structural gradients optimize attention topology to enhance reconstruction fidelity, while semantic gradients update feature representations to strengthen perceptual quality, thereby transforming the inherent zero-sum competition into a mutually reinforcing mechanism. This approach overcomes the limitations of conventional joint optimization strategies, achieving state-of-the-art generation quality with a gFID of 3.08 and surpassing the teacher model InternViT-300M in linear probing accuracy (85.2% vs. 82.5%).
π Abstract
Unified visual tokenization faces a fundamental trade-off between high-fidelity pixel reconstruction (spatial equivariance) and semantic abstraction (conceptual invariance). We attribute this conflict to Manifold Misalignment: naive joint optimization induces opposing gradients, creating a zero-sum game between reconstruction and perception. To address this, we propose MUSE, a framework based on Topological Orthogonality. By treating Structure as an orthogonal bridge, MUSE decouples optimization within Transformers: structural gradients refine attention topology, while semantic gradients update feature values. This turns destructive interference into Mutual Reinforcement. Experiments show that MUSE breaks the trade-off, achieving state-of-the-art generation quality (gFID 3.08) and surpassing its teacher InternViT-300M in linear probing (85.2\% vs. 82.5\%), demonstrating that structurally aligned reconstruction can enhance semantic perception. Code is available at https://github.com/PanqiYang1/MUSE.