DINO-CoDT: Multi-class Collaborative Detection and Tracking with Vision Foundation Models

📅 2025-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing collaborative perception methods predominantly assume a single-class (e.g., vehicle-only) scenario, failing to address the substantial appearance and motion pattern disparities among heterogeneous road users—such as pedestrians and cyclists. This work introduces the first integrated multi-class 3D detection and tracking framework for vehicle-infrastructure cooperative perception. Our core contributions are: (1) Global Spatial Attention Fusion (GSAF), which enhances cross-view feature alignment; (2) Semantic-driven trajectory re-identification (REID) leveraging the DINOv2 vision foundation model to improve inter-camera identity consistency; and (3) Velocity-Adaptive Trajectory Management (VATM), mitigating ID switches and small-object drift. Evaluated on V2X-Real and OPV2V benchmarks, our method achieves state-of-the-art performance in both detection and tracking accuracy—particularly excelling in small-object ID stability and cross-view robustness.

Technology Category

Application Category

📝 Abstract
Collaborative perception plays a crucial role in enhancing environmental understanding by expanding the perceptual range and improving robustness against sensor failures, which primarily involves collaborative 3D detection and tracking tasks. The former focuses on object recognition in individual frames, while the latter captures continuous instance tracklets over time. However, existing works in both areas predominantly focus on the vehicle superclass, lacking effective solutions for both multi-class collaborative detection and tracking. This limitation hinders their applicability in real-world scenarios, which involve diverse object classes with varying appearances and motion patterns. To overcome these limitations, we propose a multi-class collaborative detection and tracking framework tailored for diverse road users. We first present a detector with a global spatial attention fusion (GSAF) module, enhancing multi-scale feature learning for objects of varying sizes. Next, we introduce a tracklet RE-IDentification (REID) module that leverages visual semantics with a vision foundation model to effectively reduce ID SWitch (IDSW) errors, in cases of erroneous mismatches involving small objects like pedestrians. We further design a velocity-based adaptive tracklet management (VATM) module that adjusts the tracking interval dynamically based on object motion. Extensive experiments on the V2X-Real and OPV2V datasets show that our approach significantly outperforms existing state-of-the-art methods in both detection and tracking accuracy.
Problem

Research questions and friction points this paper is trying to address.

Lack of multi-class solutions for collaborative detection and tracking
Challenges in tracking diverse objects with varying appearances and motions
High ID switch errors for small objects like pedestrians
Innovation

Methods, ideas, or system contributions that make the work stand out.

Global spatial attention fusion for multi-scale detection
Vision foundation model for tracklet RE-IDentification
Velocity-based adaptive tracklet management for dynamic intervals
🔎 Similar Papers
No similar papers found.
X
Xunjie He
School of Automation, Beijing Institute of Technology, Beijing, 100081, China
C
Christina Dao Wen Lee
Advanced Robotics Centre, Mechanical Engineering, National University of Singapore, Singapore
M
Meiling Wang
School of Automation, Beijing Institute of Technology, Beijing, 100081, China
C
Chengran Yuan
Advanced Robotics Centre, Mechanical Engineering, National University of Singapore, Singapore
Zefan Huang
Zefan Huang
National University of Singapore
RoboticsAutonomous VehiclesArtificial Intelligence
Y
Yufeng Yue
School of Automation, Beijing Institute of Technology, Beijing, 100081, China
M
Marcelo H. Ang
Advanced Robotics Centre, Mechanical Engineering, National University of Singapore, Singapore