Anyview: Generalizable Indoor 3D Object Detection with Variable Frames

📅 2023-10-09
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
To address the poor generalization of existing 3D detectors under variable-length RGB-D input sequences in robot navigation, this paper proposes the first end-to-end indoor 3D object detection framework supporting arbitrary frame counts. Methodologically, we introduce a geometric learner and a spatial hybrid attention module to enable efficient interaction between local geometric features and global semantic features; we further propose a novel dynamic token sampling strategy that adaptively adjusts per-frame feature density to ensure consistent global feature distribution after multi-frame fusion. Evaluated on ScanNet, our method achieves state-of-the-art detection accuracy while maintaining stable performance across varying input lengths (1–8 frames), with parameter count comparable to baselines and a lightweight, efficient architecture. The core contribution lies in eliminating the fixed-frame constraint, enabling— for the first time—a single model to robustly process variable-length RGB-D sequences.
📝 Abstract
In this paper, we propose a novel network framework for indoor 3D object detection to handle variable input frame numbers in practical scenarios. Existing methods only consider fixed frames of input data for a single detector, such as monocular RGB-D images or point clouds reconstructed from dense multi-view RGB-D images. While in practical application scenes such as robot navigation and manipulation, the raw input to the 3D detectors is the RGB-D images with variable frame numbers instead of the reconstructed scene point cloud. However, the previous approaches can only handle fixed frame input data and have poor performance with variable frame input. In order to facilitate 3D object detection methods suitable for practical tasks, we present a novel 3D detection framework named AnyView for our practical applications, which generalizes well across different numbers of input frames with a single model. To be specific, we propose a geometric learner to mine the local geometric features of each input RGB-D image frame and implement local-global feature interaction through a designed spatial mixture module. Meanwhile, we further utilize a dynamic token strategy to adaptively adjust the number of extracted features for each frame, which ensures consistent global feature density and further enhances the generalization after fusion. Extensive experiments on the ScanNet dataset show our method achieves both great generalizability and high detection accuracy with a simple and clean architecture containing a similar amount of parameters with the baselines.
Problem

Research questions and friction points this paper is trying to address.

Handles variable input frame numbers for indoor 3D detection
Improves performance with variable RGB-D frames in practical scenarios
Ensures consistent feature density across different input frames
Innovation

Methods, ideas, or system contributions that make the work stand out.

Handles variable input frame numbers
Uses geometric learner for local features
Dynamic token strategy for feature adjustment
🔎 Similar Papers
No similar papers found.
Z
Zhenyu Wu
School of Automation, Beijing University of Posts and Telecommunications, Beijing, 100876, China
Xiuwei Xu
Xiuwei Xu
Tsinghua University
computer visionembodied AI
Z
Ziwei Wang
Department of Automation, Tsinghua University, and Beijing National Research Center for Information Science and Technology (BNRist), Beijing, 100084, China
C
Chong Xia
Department of Automation, Tsinghua University, and Beijing National Research Center for Information Science and Technology (BNRist), Beijing, 100084, China
Linqing Zhao
Linqing Zhao
‌Postdoc, Tsinghua University
Computer VisionScene Understanding
J
Jiwen Lu
Department of Automation, Tsinghua University, and Beijing National Research Center for Information Science and Technology (BNRist), Beijing, 100084, China
Haibin Yan
Haibin Yan
Beijing University of Posts and Telecommunications
Computer VisionPattern RecognitionRobotics