Multi-view Pyramid Transformer: Look Coarser to See Broader

📅 2025-12-08

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses large-scale 3D scene reconstruction from sparse image collections (tens to hundreds of views). We propose the Multi-view Pyramid Transformer (MVP), which jointly models local-to-global cross-view dependencies and fine-to-coarse intra-view structure via a two-level Transformer architecture—enabling simultaneous receptive field expansion and detail preservation. Crucially, MVP innovatively integrates pyramid-based feature aggregation with 3D Gaussian Splatting, allowing end-to-end reconstruction in a single forward pass. The method strikes a principled balance among computational efficiency, representational richness, and generalization capability. Extensive evaluation demonstrates state-of-the-art reconstruction quality across multiple benchmark datasets. Moreover, MVP exhibits strong robustness to varying numbers and configurations of input views, as well as favorable scalability to larger scenes and view counts.

Technology Category

Application Category

📝 Abstract

We propose Multi-view Pyramid Transformer (MVP), a scalable multi-view transformer architecture that directly reconstructs large 3D scenes from tens to hundreds of images in a single forward pass. Drawing on the idea of ``looking broader to see the whole, looking finer to see the details," MVP is built on two core design principles: 1) a local-to-global inter-view hierarchy that gradually broadens the model's perspective from local views to groups and ultimately the full scene, and 2) a fine-to-coarse intra-view hierarchy that starts from detailed spatial representations and progressively aggregates them into compact, information-dense tokens. This dual hierarchy achieves both computational efficiency and representational richness, enabling fast reconstruction of large and complex scenes. We validate MVP on diverse datasets and show that, when coupled with 3D Gaussian Splatting as the underlying 3D representation, it achieves state-of-the-art generalizable reconstruction quality while maintaining high efficiency and scalability across a wide range of view configurations.

Problem

Research questions and friction points this paper is trying to address.

Reconstructs large 3D scenes from many images efficiently

Balances computational efficiency with detailed scene representation

Achieves scalable multi-view reconstruction in single forward pass

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-view transformer reconstructs large 3D scenes from many images

Local-to-global and fine-to-coarse hierarchies enhance perspective and efficiency

Combines with 3D Gaussian Splatting for high-quality scalable reconstruction

🔎 Similar Papers

No similar papers found.

World Labs

$250,000-$350,000 base salary (good-faith estimate for San Francisco Bay Area upon hire; actual offer based on experience, skills, and qualifications)

San Francisco / San Francisco Office, San Francisco, California, United States

3D Computer Vision Researcher

Kitware

Arlington, Virginia

Authors to Follow