🤖 AI Summary
LiDAR point cloud perception in complex urban scenes suffers from low accuracy, and 3D object detection and instance segmentation are typically addressed separately, hindering holistic scene understanding. Method: We propose a collaborative window processing mechanism and a variational mask-based instance segmentation framework. Our approach employs a bird’s-eye-view representation, a hierarchical window encoder-decoder architecture, and a query-based decoding head, integrating parallel window feature extraction with mask-driven fine-grained instance prediction to jointly optimize 3D detection and instance segmentation within a unified paradigm. Contribution/Results: To our knowledge, this is the first method enabling context-consistent, interpretable, and data-self-consistent multi-task joint inference—overcoming the limitations of conventional single-task regression. On nuScenes and other benchmarks, it achieves significant improvements: +2.3% mAP for 3D detection and +4.1% PQ for segmentation consistency, thereby enhancing robustness in scene understanding and decision-making reliability for autonomous driving systems.
📝 Abstract
Accurate perception and scene understanding in complex urban environments is a critical challenge for ensuring safe and efficient autonomous navigation. In this paper, we present Co-Win, a novel bird's eye view (BEV) perception framework that integrates point cloud encoding with efficient parallel window-based feature extraction to address the multi-modality inherent in environmental understanding. Our method employs a hierarchical architecture comprising a specialized encoder, a window-based backbone, and a query-based decoder head to effectively capture diverse spatial features and object relationships. Unlike prior approaches that treat perception as a simple regression task, our framework incorporates a variational approach with mask-based instance segmentation, enabling fine-grained scene decomposition and understanding. The Co-Win architecture processes point cloud data through progressive feature extraction stages, ensuring that predicted masks are both data-consistent and contextually relevant. Furthermore, our method produces interpretable and diverse instance predictions, enabling enhanced downstream decision-making and planning in autonomous driving systems.