Co-Win: Joint Object Detection and Instance Segmentation in LiDAR Point Clouds via Collaborative Window Processing

📅 2025-07-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
LiDAR point cloud perception in complex urban scenes suffers from low accuracy, and 3D object detection and instance segmentation are typically addressed separately, hindering holistic scene understanding. Method: We propose a collaborative window processing mechanism and a variational mask-based instance segmentation framework. Our approach employs a bird’s-eye-view representation, a hierarchical window encoder-decoder architecture, and a query-based decoding head, integrating parallel window feature extraction with mask-driven fine-grained instance prediction to jointly optimize 3D detection and instance segmentation within a unified paradigm. Contribution/Results: To our knowledge, this is the first method enabling context-consistent, interpretable, and data-self-consistent multi-task joint inference—overcoming the limitations of conventional single-task regression. On nuScenes and other benchmarks, it achieves significant improvements: +2.3% mAP for 3D detection and +4.1% PQ for segmentation consistency, thereby enhancing robustness in scene understanding and decision-making reliability for autonomous driving systems.

Technology Category

Application Category

📝 Abstract
Accurate perception and scene understanding in complex urban environments is a critical challenge for ensuring safe and efficient autonomous navigation. In this paper, we present Co-Win, a novel bird's eye view (BEV) perception framework that integrates point cloud encoding with efficient parallel window-based feature extraction to address the multi-modality inherent in environmental understanding. Our method employs a hierarchical architecture comprising a specialized encoder, a window-based backbone, and a query-based decoder head to effectively capture diverse spatial features and object relationships. Unlike prior approaches that treat perception as a simple regression task, our framework incorporates a variational approach with mask-based instance segmentation, enabling fine-grained scene decomposition and understanding. The Co-Win architecture processes point cloud data through progressive feature extraction stages, ensuring that predicted masks are both data-consistent and contextually relevant. Furthermore, our method produces interpretable and diverse instance predictions, enabling enhanced downstream decision-making and planning in autonomous driving systems.
Problem

Research questions and friction points this paper is trying to address.

Joint object detection and segmentation in LiDAR point clouds
Efficient feature extraction for urban scene understanding
Interpretable instance predictions for autonomous driving
Innovation

Methods, ideas, or system contributions that make the work stand out.

BEV perception framework with parallel window processing
Hierarchical architecture for spatial feature extraction
Variational approach with mask-based instance segmentation
🔎 Similar Papers
No similar papers found.