🤖 AI Summary
Existing monocular multi-view 3D plane reconstruction methods adopt a fragmented, multi-module paradigm, leading to poor inter-task coordination and suboptimal performance. This paper proposes the first end-to-end, single-stage unified framework that jointly optimizes plane detection, segmentation, parameter regression, inter-frame association, and 6-DoF camera pose estimation. Our core innovation is a learnable plane-query-based Transformer architecture that eliminates reliance on initial pose priors or manually annotated plane correspondences. We further introduce a multi-task joint loss and self-supervised cross-view consistency modeling to enforce geometric coherence across views. Extensive experiments on ScanNetv1/v2, NYUv2-Plane, and Matterport3D demonstrate consistent and significant improvements over state-of-the-art methods across all sub-tasks, with strong positive synergistic effects observed between modules.
📝 Abstract
3D plane reconstruction from images can usually be divided into several sub-tasks of plane detection, segmentation, parameters regression and possibly depth prediction for per-frame, along with plane correspondence and relative camera pose estimation between frames. Previous works tend to divide and conquer these sub-tasks with distinct network modules, overall formulated by a two-stage paradigm. With an initial camera pose and per-frame plane predictions provided from the first stage, exclusively designed modules, potentially relying on extra plane correspondence labelling, are applied to merge multi-view plane entities and produce 6DoF camera pose. As none of existing works manage to integrate above closely related sub-tasks into a unified framework but treat them separately and sequentially, we suspect it potentially as a main source of performance limitation for existing approaches. Motivated by this finding and the success of query-based learning in enriching reasoning among semantic entities, in this paper, we propose PlaneRecTR++, a Transformer-based architecture, which for the first time unifies all sub-tasks related to multi-view reconstruction and pose estimation with a compact single-stage model, refraining from initial pose estimation and plane correspondence supervision. Extensive quantitative and qualitative experiments demonstrate that our proposed unified learning achieves mutual benefits across sub-tasks, obtaining a new state-of-the-art performance on public ScanNetv1, ScanNetv2, NYUv2-Plane, and MatterPort3D datasets.