MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality

πŸ“… 2026-05-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

206K/year
πŸ€– AI Summary
This work addresses the optimization conflict between high-fidelity reconstruction and semantic abstraction in visual tokenization, which arises from manifold misalignment. The authors propose MUSE, a novel framework that introduces topological orthogonality for the first time, leveraging structural information as an orthogonal bridge to decouple these dual objectives within a Transformer architecture. Specifically, structural gradients optimize attention topology to enhance reconstruction fidelity, while semantic gradients update feature representations to strengthen perceptual quality, thereby transforming the inherent zero-sum competition into a mutually reinforcing mechanism. This approach overcomes the limitations of conventional joint optimization strategies, achieving state-of-the-art generation quality with a gFID of 3.08 and surpassing the teacher model InternViT-300M in linear probing accuracy (85.2% vs. 82.5%).
πŸ“ Abstract
Unified visual tokenization faces a fundamental trade-off between high-fidelity pixel reconstruction (spatial equivariance) and semantic abstraction (conceptual invariance). We attribute this conflict to Manifold Misalignment: naive joint optimization induces opposing gradients, creating a zero-sum game between reconstruction and perception. To address this, we propose MUSE, a framework based on Topological Orthogonality. By treating Structure as an orthogonal bridge, MUSE decouples optimization within Transformers: structural gradients refine attention topology, while semantic gradients update feature values. This turns destructive interference into Mutual Reinforcement. Experiments show that MUSE breaks the trade-off, achieving state-of-the-art generation quality (gFID 3.08) and surpassing its teacher InternViT-300M in linear probing (85.2\% vs. 82.5\%), demonstrating that structurally aligned reconstruction can enhance semantic perception. Code is available at https://github.com/PanqiYang1/MUSE.
Problem

Research questions and friction points this paper is trying to address.

Manifold Misalignment
Visual Tokenization
Spatial Equivariance
Conceptual Invariance
Topological Orthogonality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Topological Orthogonality
Manifold Misalignment
Visual Tokenization
Mutual Reinforcement
Structure-Semantic Decoupling
P
Panqi Yang
State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center of Visual Information and Applications, and Institute of Artificial Intelligence and Robotics, Xi’an Jiao Tong University
H
Haodong Jing
State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center of Visual Information and Applications, and Institute of Artificial Intelligence and Robotics, Xi’an Jiao Tong University
J
Jiahao Chao
Xiaohongshu Inc.
T
Tingyan Xiang
Xiaohongshu Inc.
L
Li Lin
Xiaohongshu Inc.
Yao Hu
Yao Hu
ζ΅™ζ±Ÿε€§ε­¦
Machine Learning
Y
Yang Luo
Xiaohongshu Inc.
Yongqiang Ma
Yongqiang Ma
Wuhan University
Scientific Information MiningLarge Language ModelsAI for Science