InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

To address the challenge of language-guided object recognition in remote sensing imagery under open-vocabulary, implicit semantic, and question-answering query settings, this paper introduces the EarthInstruct benchmark and a novel Instruction-Oriented Object Counting, Detection, and Segmentation (CDS) task paradigm. Methodologically, we propose a training-free framework that integrates large vision-language models for instruction understanding, SAM2 for mask generation, and jointly models semantic matching and counting constraints as a binary integer programming problem—eliminating confidence thresholds and post-processing heuristics. The framework enables zero-shot counting, detection, and segmentation without fine-tuning. Experiments demonstrate performance competitive with or superior to task-specific baselines, with constant inference latency (independent of object count), 89% reduction in output tokens, and over 32% decrease in total runtime—significantly enhancing efficiency for large-scale remote sensing mapping and automated annotation.

Technology Category

Application Category

📝 Abstract

Language-Guided object recognition in remote sensing imagery is crucial for large-scale mapping and automated data annotation. However, existing open-vocabulary and visual grounding methods rely on explicit category cues, limiting their ability to handle complex or implicit queries that require advanced reasoning. To address this issue, we introduce a new suite of tasks, including Instruction-Oriented Object Counting, Detection, and Segmentation (InstructCDS), covering open-vocabulary, open-ended, and open-subclass scenarios. We further present EarthInstruct, the first InstructCDS benchmark for earth observation. It is constructed from two diverse remote sensing datasets with varying spatial resolutions and annotation rules across 20 categories, necessitating models to interpret dataset-specific instructions. Given the scarcity of semantically rich labeled data in remote sensing, we propose InstructSAM, a training-free framework for instruction-driven object recognition. InstructSAM leverages large vision-language models to interpret user instructions and estimate object counts, employs SAM2 for mask proposal, and formulates mask-label assignment as a binary integer programming problem. By integrating semantic similarity with counting constraints, InstructSAM efficiently assigns categories to predicted masks without relying on confidence thresholds. Experiments demonstrate that InstructSAM matches or surpasses specialized baselines across multiple tasks while maintaining near-constant inference time regardless of object count, reducing output tokens by 89% and overall runtime by over 32% compared to direct generation approaches. We believe the contributions of the proposed tasks, benchmark, and effective approach will advance future research in developing versatile object recognition systems.

Problem

Research questions and friction points this paper is trying to address.

Enables instruction-driven remote sensing object recognition without training

Addresses complex queries in open-vocabulary and visual grounding scenarios

Solves mask-label assignment via semantic similarity and counting constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages vision-language models for instruction interpretation

Uses SAM2 for mask proposal generation

Formulates mask-label assignment as binary integer programming

🔎 Similar Papers

Locate Anything on Earth: Advancing Open-Vocabulary Object Detection for Remote Sensing Community