🤖 AI Summary
This work proposes a novel approach to LiDAR-based 3D object detection by formulating it as an autoregressive sequence generation task, eliminating the need for hand-crafted anchor assignments and non-maximum suppression (NMS). The method decodes object parameters—including center location, size, orientation, velocity, and class—as discrete tokens in a causal, near-to-far order directly from point cloud features. By discarding anchors and NMS, the framework enables fully end-to-end training and seamlessly integrates with diverse point cloud backbone architectures. Furthermore, it opens new avenues for leveraging language modeling techniques—such as GRPO reinforcement learning—to optimize perceptual objectives. Evaluated on the nuScenes benchmark, the proposed approach achieves performance comparable to state-of-the-art methods, demonstrating the feasibility of anchor-free, NMS-free 3D detection.
📝 Abstract
LiDAR-based 3D object detectors typically rely on proposal heads with hand-crafted components like anchor assignment and non-maximum suppression (NMS), complicating training and limiting extensibility. We present AutoReg3D, an autoregressive 3D detector that casts detection as sequence generation. Given point-cloud features, AutoReg3D emits objects in a range-causal (near-to-far) order and encodes each object as a short, discrete-token sequence consisting of its center, size, orientation, velocity, and class. This near-to-far ordering mirrors LiDAR geometry--near objects occlude far ones but not vice versa--enabling straightforward teacher forcing during training and autoregressive decoding at test time. AutoReg3D is compatible across diverse point-cloud or backbones and attains competitive nuScenes performance without anchors or NMS. Beyond parity, the sequential formulation unlocks language-model advances for 3D perception, including GRPO-style reinforcement learning for task-aligned objectives. These results position autoregressive decoding as a viable, flexible alternative for LiDAR-based detection and open a path to importing modern sequence-modeling tools into 3D perception.