🤖 AI Summary
To address the limitations of Transformer-based detectors in traffic scenarios—namely, inaccurate localization under occlusion and for small objects, as well as computational redundancy—this paper proposes a dual-stream attention mechanism with multimodal queries. Methodologically, it introduces three heterogeneous query types: vision-language appearance queries, polygonal position queries, and learnable random queries; constructs a dual-stream cross-attention module to separately align semantic and spatial features; and incorporates a sparse attention strategy to enhance efficiency. The key innovation lies in the first coupling of multimodal queries with a dual-stream architecture, enabling query-adaptive selection and disentangled feature modeling. Evaluated on four benchmarks—including BDD100K and TT100K—the method achieves state-of-the-art performance, with significant improvements in average precision (AP) and recall, demonstrating effective joint optimization of detection accuracy and computational efficiency.
📝 Abstract
Transformer-based object detectors often struggle with occlusions, fine-grained localization, and computational inefficiency caused by fixed queries and dense attention. We propose DAMM, Dual-stream Attention with Multi-Modal queries, a novel framework introducing both query adaptation and structured cross-attention for improved accuracy and efficiency. DAMM capitalizes on three types of queries: appearance-based queries from vision-language models, positional queries using polygonal embeddings, and random learned queries for general scene coverage. Furthermore, a dual-stream cross-attention module separately refines semantic and spatial features, boosting localization precision in cluttered scenes. We evaluated DAMM on four challenging benchmarks, and it achieved state-of-the-art performance in average precision (AP) and recall, demonstrating the effectiveness of multi-modal query adaptation and dual-stream attention. Source code is at: href{https://github.com/DET-LIP/DAMM}{GitHub}.