Learning Where to Embed: Noise-Aware Positional Embedding for Query Retrieval in Small-Object Detection

📅 2026-04-16
📈 Citations: 0
Influential: 0
📄 PDF

career value

191K/year
🤖 AI Summary
This work addresses the challenges faced by Transformer-based detectors in small object detection, including susceptibility to background noise, low-quality object queries, and decoder redundancy. To this end, we propose the HELP framework, which introduces a novel heatmap-guided positional embedding (HPE) mechanism that enables noise-aware fusion of positional and semantic information, preserving salient foreground structures while suppressing background interference. Additionally, gradient-mask filtering combined with Linear-Snake convolution enhances query quality. Notably, gradient supervision is applied only during training, incurring no additional computational overhead at inference. Experiments demonstrate that HELP reduces the decoder depth from eight to three layers and cuts model parameters by 59.4% (from 163M to 66.3M), achieving higher accuracy with lower computational cost across multiple benchmarks, alongside an interpretable heatbar visualization tool.

Technology Category

Application Category

📝 Abstract
Transformer-based detectors have advanced small-object detection, but they often remain inefficient and vulnerable to background-induced query noise, which motivates deep decoders to refine low-quality queries. We present HELP (Heatmap-guided Embedding Learning Paradigm), a noise-aware positional-semantic fusion framework that studies where to embed positional information by selectively preserving positional encodings in foreground-salient regions while suppressing background clutter. Within HELP, we introduce Heatmap-guided Positional Embedding (HPE) as the core embedding mechanism and visualize it with a heatbar for interpretable diagnosis and fine-tuning. HPE is integrated into both the encoder and decoder: it guides noise-suppressed feature encoding by injecting heatmap-aware positional encoding, and it enables high-quality query retrieval by filtering background-dominant embeddings via a gradient-based mask filter before decoding. To address feature sparsity in complex small targets, we integrate Linear-Snake Convolution to enrich retrieval-relevant representations. The gradient-based heatmap supervision is used during training only, incurring no additional gradient computation at inference. As a result, our design reduces decoder layers from eight to three and achieves a 59.4% parameter reduction (66.3M vs. 163M) while maintaining consistent accuracy gains under a reduced compute budget across benchmarks. Code Repository: https://github.com/yidimopozhibai/Noise-Suppressed-Query-Retrieval
Problem

Research questions and friction points this paper is trying to address.

small-object detection
query retrieval
background noise
positional embedding
Transformer-based detectors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Noise-aware positional embedding
Heatmap-guided embedding
Query retrieval
Small-object detection
Transformer decoder optimization
🔎 Similar Papers
No similar papers found.