🤖 AI Summary
This work addresses a key limitation in existing feed-forward neural view synthesis (NVS) Transformers, where semantic and spatial information are entangled within a shared feature space, causing spatial bias to interfere with appearance representation and degrade rendering fidelity. To resolve this, the authors propose a semantics-spatial disentangled architecture that explicitly separates feature representations into independent branches while enabling efficient cross-branch interaction through shared attention routing. Additionally, they introduce optional classification supervision and a bidirectional modulation mechanism to enhance representational capacity with negligible impact on inference latency. The proposed approach consistently improves performance across both decoder-only and encoder-decoder variants of feed-forward NVS models, yielding significantly higher rendering quality.
📝 Abstract
Transformer-based models have advanced feedforward novel view synthesis (NVS). Current architectures such as GS-LRM and LVSM mix semantic information (e.g., RGB) and spatial information (e.g., Plücker rays) into a shared feature space. Since Plücker rays naturally carry lattice-like spatial structure, these designs can make the spatial bias interfere with appearance representation and degrade rendering fidelity. To this end, we propose to decouple the representation of feedforward NVS transformers into separate semantic and spatial tokens. The decoupled design keeps semantic and spatial information explicit in their branches while preserving cross-branch interaction through shared attention routing. Built on this design, we introduce optional categorized supervision and bidirectional modulation: the former provides branch-specific training signals, while the latter improves interaction between the two branches. Notably, the base decoupled design introduces virtually zero additional inference latency due to its architectural design. The proposed designs achieve consistent improvements, demonstrating effectiveness across decoder-only and encoder-decoder feedforward NVS models.