SRSplat: Feed-Forward Super-Resolution Gaussian Splatting from Sparse Multi-View Images

📅 2025-11-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing feedforward sparse low-resolution (LR) multi-view 3D reconstruction methods struggle to recover fine-grained textures, limiting their applicability in autonomous driving and embodied intelligence. To address this, we propose a reference-guided high-resolution 3D reconstruction framework. First, we leverage multimodal large language models and diffusion models to synthesize high-fidelity cross-domain reference images and construct scene-specific reference libraries. Second, we design a reference-guided feature enhancement module and a texture-aware density control mechanism to align and fuse external reference features with internal texture cues. Finally, reconstruction is performed via Gaussian primitive decoding with adaptive density regulation. Our method achieves state-of-the-art performance on RealEstate10K, ACID, and DTU benchmarks, demonstrating superior detail recovery and strong generalization across datasets and input resolutions.

Technology Category

Application Category

📝 Abstract
Feed-forward 3D reconstruction from sparse, low-resolution (LR) images is a crucial capability for real-world applications, such as autonomous driving and embodied AI. However, existing methods often fail to recover fine texture details. This limitation stems from the inherent lack of high-frequency information in LR inputs. To address this, we propose extbf{SRSplat}, a feed-forward framework that reconstructs high-resolution 3D scenes from only a few LR views. Our main insight is to compensate for the deficiency of texture information by jointly leveraging external high-quality reference images and internal texture cues. We first construct a scene-specific reference gallery, generated for each scene using Multimodal Large Language Models (MLLMs) and diffusion models. To integrate this external information, we introduce the extit{Reference-Guided Feature Enhancement (RGFE)} module, which aligns and fuses features from the LR input images and their reference twin image. Subsequently, we train a decoder to predict the Gaussian primitives using the multi-view fused feature obtained from extit{RGFE}. To further refine predicted Gaussian primitives, we introduce extit{Texture-Aware Density Control (TADC)}, which adaptively adjusts Gaussian density based on the internal texture richness of the LR inputs. Extensive experiments demonstrate that our SRSplat outperforms existing methods on various datasets, including RealEstate10K, ACID, and DTU, and exhibits strong cross-dataset and cross-resolution generalization capabilities.
Problem

Research questions and friction points this paper is trying to address.

Reconstructing high-resolution 3D scenes from sparse low-resolution images
Addressing the lack of high-frequency texture details in existing methods
Improving 3D reconstruction for autonomous driving and embodied AI applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates scene-specific reference images using MLLMs
Aligns LR inputs with reference images via RGFE module
Adjusts Gaussian density adaptively with TADC mechanism
🔎 Similar Papers
No similar papers found.
Xinyuan Hu
Xinyuan Hu
Undergrad at Emory University
AILLM
C
Changyue Shi
School of Computer Science and Technology, Hangzhou Dianzi University
C
Chuxiao Yang
School of Computer Science and Technology, Hangzhou Dianzi University
M
Minghao Chen
School of Computer Science and Technology, Hangzhou Dianzi University
J
Jiajun Ding
School of Computer Science and Technology, Hangzhou Dianzi University
T
Tao Wei
School of AI for Science, Peking University
C
Chen Wei
Li Auto Inc.
Z
Zhou Yu
School of Computer Science and Technology, Hangzhou Dianzi University
Min Tan
Min Tan
Professor of School of Computer Science and Technology, Hangzhou Dianzi University
Machine LearningImage ProcessingMultimediaComputer Vision