Heuristic-inspired Reasoning Priors Facilitate Data-Efficient Referring Object Detection

📅 2026-03-25

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the challenge of modeling spatial and semantic structures in referring expression object detection under scarce annotation conditions. To this end, we propose HeROD, a framework that injects lightweight, model-agnostic, and interpretable heuristic spatial-semantic reasoning priors into three key stages of a DETR-style detection pipeline: proposal ranking, prediction fusion, and Hungarian matching. Our approach is the first to explicitly incorporate such heuristic priors into this task and introduces De-ROD, a new low-data evaluation benchmark. Experimental results demonstrate that HeROD significantly outperforms strong existing baselines across few-shot and low-label settings on RefCOCO, RefCOCO+, and RefCOCOg.

Technology Category

Application Category

📝 Abstract

Most referring object detection (ROD) models, especially the modern grounding detectors, are designed for data-rich conditions, yet many practical deployments, such as robotics, augmented reality, and other specialized domains, would face severe label scarcity. In such regimes, end-to-end grounding detectors need to learn spatial and semantic structure from scratch, wasting precious samples. We ask a simple question: Can explicit reasoning priors help models learn more efficiently when data is scarce? To explore this, we first introduce a Data-efficient Referring Object Detection (De-ROD) task, which is a benchmark protocol for measuring ROD performance in low-data and few-shot settings. We then propose the HeROD (Heuristic-inspired ROD), a lightweight, model-agnostic framework that injects explicit, heuristic-inspired spatial and semantic reasoning priors, which are interpretable signals derived based on the referring phrase, into 3 stages of a modern DETR-style pipeline: proposal ranking, prediction fusion, and Hungarian matching. By biasing both training and inference toward plausible candidates, these priors promise to improve label efficiency and convergence performance. On RefCOCO, RefCOCO+, and RefCOCOg, HeROD consistently outperforms strong grounding baselines in scarce-label regimes. More broadly, our results suggest that integrating simple, interpretable reasoning priors provides a practical and extensible path toward better data-efficient vision-language understanding.

Problem

Research questions and friction points this paper is trying to address.

referring object detection

data efficiency

label scarcity

few-shot learning

vision-language understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

referring object detection

data efficiency

reasoning priors