Resolving Representation Ambiguity in Feedforward Novel View Synthesis Transformer via Semantic-Spatial Decoupling

📅 2026-05-18
📈 Citations: 0
Influential: 0
📄 PDF

career value

194K/year
🤖 AI Summary
This work addresses a key limitation in existing feed-forward neural view synthesis (NVS) Transformers, where semantic and spatial information are entangled within a shared feature space, causing spatial bias to interfere with appearance representation and degrade rendering fidelity. To resolve this, the authors propose a semantics-spatial disentangled architecture that explicitly separates feature representations into independent branches while enabling efficient cross-branch interaction through shared attention routing. Additionally, they introduce optional classification supervision and a bidirectional modulation mechanism to enhance representational capacity with negligible impact on inference latency. The proposed approach consistently improves performance across both decoder-only and encoder-decoder variants of feed-forward NVS models, yielding significantly higher rendering quality.
📝 Abstract
Transformer-based models have advanced feedforward novel view synthesis (NVS). Current architectures such as GS-LRM and LVSM mix semantic information (e.g., RGB) and spatial information (e.g., Plücker rays) into a shared feature space. Since Plücker rays naturally carry lattice-like spatial structure, these designs can make the spatial bias interfere with appearance representation and degrade rendering fidelity. To this end, we propose to decouple the representation of feedforward NVS transformers into separate semantic and spatial tokens. The decoupled design keeps semantic and spatial information explicit in their branches while preserving cross-branch interaction through shared attention routing. Built on this design, we introduce optional categorized supervision and bidirectional modulation: the former provides branch-specific training signals, while the latter improves interaction between the two branches. Notably, the base decoupled design introduces virtually zero additional inference latency due to its architectural design. The proposed designs achieve consistent improvements, demonstrating effectiveness across decoder-only and encoder-decoder feedforward NVS models.
Problem

Research questions and friction points this paper is trying to address.

novel view synthesis
representation ambiguity
semantic-spatial decoupling
Transformer
rendering fidelity
Innovation

Methods, ideas, or system contributions that make the work stand out.

semantic-spatial decoupling
feedforward novel view synthesis
Transformer architecture
Plücker rays
cross-branch interaction
Y
Yihang Wu
Institute of Trustworthy Embodied Artificial Intelligence (TEAI), Fudan University; Shanghai Key Laboratory of Multimodal Embodied AI
Y
Yihang Sun
Sch. of Artificial Intelligence & Sch. of Computer Science, Shanghai Jiao Tong University
Shaofeng Zhang
Shaofeng Zhang
Southern University of Science and Technology
Learn to Optimize
Zuxuan Wu
Zuxuan Wu
Fudan University
Junchi Yan
Junchi Yan
FIAPR & ICML Board Member, SJTU (2018-), SII (2024-), AWS (2019-2022), IBM (2011-2018)
Computational IntelligenceAI4ScienceMachine LearningAutonomous Driving
Xiaosong Jia
Xiaosong Jia
Assistant Professor, Institute of Trustworthy Embodied AI (TEAI), Fudan University
Embodied AIAutonomous DrivingWorld ModelReinforcement Learning
Y
Yu-gang Jiang
Institute of Trustworthy Embodied Artificial Intelligence (TEAI), Fudan University; Shanghai Key Laboratory of Multimodal Embodied AI