Learning Where to Embed: Noise-Aware Positional Embedding for Query Retrieval in Small-Object Detection

📅 2026-04-16

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

This work addresses the challenges faced by Transformer-based detectors in small object detection, including susceptibility to background noise, low-quality object queries, and decoder redundancy. To this end, we propose the HELP framework, which introduces a novel heatmap-guided positional embedding (HPE) mechanism that enables noise-aware fusion of positional and semantic information, preserving salient foreground structures while suppressing background interference. Additionally, gradient-mask filtering combined with Linear-Snake convolution enhances query quality. Notably, gradient supervision is applied only during training, incurring no additional computational overhead at inference. Experiments demonstrate that HELP reduces the decoder depth from eight to three layers and cuts model parameters by 59.4% (from 163M to 66.3M), achieving higher accuracy with lower computational cost across multiple benchmarks, alongside an interpretable heatbar visualization tool.

Technology Category

Application Category

📝 Abstract

Transformer-based detectors have advanced small-object detection, but they often remain inefficient and vulnerable to background-induced query noise, which motivates deep decoders to refine low-quality queries. We present HELP (Heatmap-guided Embedding Learning Paradigm), a noise-aware positional-semantic fusion framework that studies where to embed positional information by selectively preserving positional encodings in foreground-salient regions while suppressing background clutter. Within HELP, we introduce Heatmap-guided Positional Embedding (HPE) as the core embedding mechanism and visualize it with a heatbar for interpretable diagnosis and fine-tuning. HPE is integrated into both the encoder and decoder: it guides noise-suppressed feature encoding by injecting heatmap-aware positional encoding, and it enables high-quality query retrieval by filtering background-dominant embeddings via a gradient-based mask filter before decoding. To address feature sparsity in complex small targets, we integrate Linear-Snake Convolution to enrich retrieval-relevant representations. The gradient-based heatmap supervision is used during training only, incurring no additional gradient computation at inference. As a result, our design reduces decoder layers from eight to three and achieves a 59.4% parameter reduction (66.3M vs. 163M) while maintaining consistent accuracy gains under a reduced compute budget across benchmarks. Code Repository: https://github.com/yidimopozhibai/Noise-Suppressed-Query-Retrieval

Problem

Research questions and friction points this paper is trying to address.

small-object detection

query retrieval

background noise

positional embedding

Transformer-based detectors

Innovation

Methods, ideas, or system contributions that make the work stand out.

Noise-aware positional embedding

Heatmap-guided embedding

Query retrieval