π€ AI Summary
This work addresses the failure of existing 3D reconstruction methods on reflective, transparent, and low-texture objects due to the scarcity of reliable photometric and geometric cues. To tackle this challenge, the authors introduce the first large-scale hybrid dataset specifically designed for such difficult materials, comprising over 22 TB of data, including more than 120,000 synthetic instances and 1,000+ real-world objects, captured across over 7 million multi-view images. By integrating physically based rendering, diffusion-model-generated 3D shapes, and real data acquired with consumer-grade devices, the dataset achieves substantial diversity in both geometry and appearance. It supports five core benchmark tasksβimage matching, structure-from-motion, novel view synthesis, reflection removal, and relighting. Experiments reveal a significant performance drop of state-of-the-art methods on this dataset, underscoring its critical role in advancing robust 3D vision models.
π Abstract
Accurate 3D reconstruction of objects with reflective, transparent, or low-texture surfaces still remains notoriously challenging. Such materials often violate key assumptions in multi-view reconstruction pipelines, such as photometric consistency and the availability on distinct geometric texture cues. Existing datasets primarily focus on diffuse, textured objects, and therefore provide limited insight into performance under real-world material complexities. We introduce 3DReflecNet, a large-scale hybrid dataset exceeding 22 TB that is specifically designed to benchmark and advance 3D vision methods for these challenging materials. 3DReflecNet combines two types of data: over 120,000 synthetic instances generated via physically-based rendering of more than 12,000 shapes, and over 1,000 real-world objects captured using consumer devices. Together, these data consist of more than 7 million multi-view frames. The dataset spans diverse materials, complex lighting conditions, and a wide range of geometric forms, including shapes generated from both real and LLM-synthesized 2D images using diffusion-based pipelines. To support robust evaluation, we design benchmarks for five core tasks: image matching, structure-from-motion, novel view synthesis, reflection removal, and relighting. Extensive experiments demonstrate that state-of-the-art methods struggle to maintain accuracy across these settings, highlighting the need for more resilient 3D vision models.