MVGGT: Multimodal Visual Geometry Grounded Transformer for Multiview 3D Referring Expression Segmentation

📅 2026-01-11
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of poor geometric reconstruction quality, coarse target localization, and high inference latency in 3D referential expression segmentation from sparse multi-view RGB images. To this end, we propose the first end-to-end multimodal Transformer framework that jointly performs 3D reconstruction and referential segmentation under sparse viewpoints through a dual-branch architecture fusing linguistic and visual cues. We formally define the MV-3DRES task and introduce the MVRefer benchmark to facilitate research in this direction. To mitigate foreground gradient dilution during training, we present a Per-view No-target Suppression Optimization (PVSO) strategy, which significantly enhances training stability and efficiency. Experimental results demonstrate that our method establishes a strong baseline on MVRefer, achieving both high accuracy and fast inference, substantially outperforming existing approaches.

Technology Category

Application Category

📝 Abstract
Most existing 3D referring expression segmentation (3DRES) methods rely on dense, high-quality point clouds, while real-world agents such as robots and mobile phones operate with only a few sparse RGB views and strict latency constraints. We introduce Multi-view 3D Referring Expression Segmentation (MV-3DRES), where the model must recover scene structure and segment the referred object directly from sparse multi-view images. Traditional two-stage pipelines, which first reconstruct a point cloud and then perform segmentation, often yield low-quality geometry, produce coarse or degraded target regions, and run slowly. We propose the Multimodal Visual Geometry Grounded Transformer (MVGGT), an efficient end-to-end framework that integrates language information into sparse-view geometric reasoning through a dual-branch design. Training in this setting exposes a critical optimization barrier, termed Foreground Gradient Dilution (FGD), where sparse 3D signals lead to weak supervision. To resolve this, we introduce Per-view No-target Suppression Optimization (PVSO), which provides stronger and more balanced gradients across views, enabling stable and efficient learning. To support consistent evaluation, we build MVRefer, a benchmark that defines standardized settings and metrics for MV-3DRES. Experiments show that MVGGT establishes the first strong baseline and achieves both high accuracy and fast inference, outperforming existing alternatives. Code and models are publicly available at https://mvggt.github.io.
Problem

Research questions and friction points this paper is trying to address.

3D referring expression segmentation
sparse multi-view images
scene reconstruction
multimodal understanding
real-world agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Transformer
Sparse-view 3D Segmentation
Referring Expression Segmentation
Gradient Optimization
End-to-end 3D Vision
🔎 Similar Papers
No similar papers found.
C
Changli Wu
Xiamen University
H
Haodong Wang
Xiamen University
Jiayi Ji
Jiayi Ji
Rutgers University
Y
Yutian Yao
Tianjin University of Science and Technology
C
Chunsai Du
ByteDance
J
Jihua Kang
ByteDance
Yanwei Fu
Yanwei Fu
Fudan University
Computer visionmachine learningMultimedia
L
Liujuan Cao
Xiamen University