Resolving Representation Ambiguity in Feedforward Novel View Synthesis Transformer via Semantic-Spatial Decoupling

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses a key limitation in existing feed-forward neural view synthesis (NVS) Transformers, where semantic and spatial information are entangled within a shared feature space, causing spatial bias to interfere with appearance representation and degrade rendering fidelity. To resolve this, the authors propose a semantics-spatial disentangled architecture that explicitly separates feature representations into independent branches while enabling efficient cross-branch interaction through shared attention routing. Additionally, they introduce optional classification supervision and a bidirectional modulation mechanism to enhance representational capacity with negligible impact on inference latency. The proposed approach consistently improves performance across both decoder-only and encoder-decoder variants of feed-forward NVS models, yielding significantly higher rendering quality.

📝 Abstract

Transformer-based models have advanced feedforward novel view synthesis (NVS). Current architectures such as GS-LRM and LVSM mix semantic information (e.g., RGB) and spatial information (e.g., Plücker rays) into a shared feature space. Since Plücker rays naturally carry lattice-like spatial structure, these designs can make the spatial bias interfere with appearance representation and degrade rendering fidelity. To this end, we propose to decouple the representation of feedforward NVS transformers into separate semantic and spatial tokens. The decoupled design keeps semantic and spatial information explicit in their branches while preserving cross-branch interaction through shared attention routing. Built on this design, we introduce optional categorized supervision and bidirectional modulation: the former provides branch-specific training signals, while the latter improves interaction between the two branches. Notably, the base decoupled design introduces virtually zero additional inference latency due to its architectural design. The proposed designs achieve consistent improvements, demonstrating effectiveness across decoder-only and encoder-decoder feedforward NVS models.

Problem

Research questions and friction points this paper is trying to address.

novel view synthesis

representation ambiguity

semantic-spatial decoupling

Transformer

rendering fidelity

Innovation

Methods, ideas, or system contributions that make the work stand out.

semantic-spatial decoupling

feedforward novel view synthesis

Transformer architecture