S-VGGT: Structure-Aware Subscene Decomposition for Scalable 3D Foundation Models

📅 2026-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the quadratic computational complexity of global attention in feedforward 3D foundation models, which hinders scalability to long input sequences. The authors propose a structure-aware sub-scene decomposition approach that eliminates structural redundancy in densely sampled frames by constructing a dense scene graph to guide scene partitioning. Frames are softly assigned to multiple sub-scenes, and a shared reference frame mechanism enables parallel processing without explicit geometric alignment. By reducing global attention overhead at the source, this method is orthogonal to existing token-level acceleration strategies, significantly improving inference efficiency while preserving reconstruction accuracy.

Technology Category

Application Category

📝 Abstract
Feed-forward 3D foundation models face a key challenge: the quadratic computational cost introduced by global attention, which severely limits scalability as input length increases. Concurrent acceleration methods, such as token merging, operate at the token level. While they offer local savings, the required nearest-neighbor searches introduce undesirable overhead. Consequently, these techniques fail to tackle the fundamental issue of structural redundancy dominant in dense capture data. In this work, we introduce \textbf{S-VGGT}, a novel approach that addresses redundancy at the structural frame level, drastically shifting the optimization focus. We first leverage the initial features to build a dense scene graph, which characterizes structural scene redundancy and guides the subsequent scene partitioning. Using this graph, we softly assign frames to a small number of subscenes, guaranteeing balanced groups and smooth geometric transitions. The core innovation lies in designing the subscenes to share a common reference frame, establishing a parallel geometric bridge that enables independent and highly efficient processing without explicit geometric alignment. This structural reorganization provides strong intrinsic acceleration by cutting the global attention cost at its source. Crucially, S-VGGT is entirely orthogonal to token-level acceleration methods, allowing the two to be seamlessly combined for compounded speedups without compromising reconstruction fidelity. Code is available at https://github.com/Powertony102/S-VGGT.
Problem

Research questions and friction points this paper is trying to address.

3D foundation models
structural redundancy
global attention
scalability
dense capture data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Structure-Aware Decomposition
Subscene Partitioning
3D Foundation Models
Global Attention Acceleration
Scene Graph
X
Xinze Li
Beijing Normal-Hong Kong Baptist University
P
Pengxu Chen
Jilin University
Y
Yiyuan Wang
Hong Kong Baptist University
Weifeng Su
Weifeng Su
State University of New York (SUNY) at Buffalo
MIMO wireless communicationsCooperative Communications and RelayingSpace-Time Coding and ModulationMIMO-OFDM Systems
W
Wentao Cheng
Beijing Normal-Hong Kong Baptist University