InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition

📅 2025-05-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of language-guided object recognition in remote sensing imagery under open-vocabulary, implicit semantic, and question-answering query settings, this paper introduces the EarthInstruct benchmark and a novel Instruction-Oriented Object Counting, Detection, and Segmentation (CDS) task paradigm. Methodologically, we propose a training-free framework that integrates large vision-language models for instruction understanding, SAM2 for mask generation, and jointly models semantic matching and counting constraints as a binary integer programming problem—eliminating confidence thresholds and post-processing heuristics. The framework enables zero-shot counting, detection, and segmentation without fine-tuning. Experiments demonstrate performance competitive with or superior to task-specific baselines, with constant inference latency (independent of object count), 89% reduction in output tokens, and over 32% decrease in total runtime—significantly enhancing efficiency for large-scale remote sensing mapping and automated annotation.

Technology Category

Application Category

📝 Abstract
Language-Guided object recognition in remote sensing imagery is crucial for large-scale mapping and automated data annotation. However, existing open-vocabulary and visual grounding methods rely on explicit category cues, limiting their ability to handle complex or implicit queries that require advanced reasoning. To address this issue, we introduce a new suite of tasks, including Instruction-Oriented Object Counting, Detection, and Segmentation (InstructCDS), covering open-vocabulary, open-ended, and open-subclass scenarios. We further present EarthInstruct, the first InstructCDS benchmark for earth observation. It is constructed from two diverse remote sensing datasets with varying spatial resolutions and annotation rules across 20 categories, necessitating models to interpret dataset-specific instructions. Given the scarcity of semantically rich labeled data in remote sensing, we propose InstructSAM, a training-free framework for instruction-driven object recognition. InstructSAM leverages large vision-language models to interpret user instructions and estimate object counts, employs SAM2 for mask proposal, and formulates mask-label assignment as a binary integer programming problem. By integrating semantic similarity with counting constraints, InstructSAM efficiently assigns categories to predicted masks without relying on confidence thresholds. Experiments demonstrate that InstructSAM matches or surpasses specialized baselines across multiple tasks while maintaining near-constant inference time regardless of object count, reducing output tokens by 89% and overall runtime by over 32% compared to direct generation approaches. We believe the contributions of the proposed tasks, benchmark, and effective approach will advance future research in developing versatile object recognition systems.
Problem

Research questions and friction points this paper is trying to address.

Enables instruction-driven remote sensing object recognition without training
Addresses complex queries in open-vocabulary and visual grounding scenarios
Solves mask-label assignment via semantic similarity and counting constraints
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages vision-language models for instruction interpretation
Uses SAM2 for mask proposal generation
Formulates mask-label assignment as binary integer programming
🔎 Similar Papers
No similar papers found.
Yijie Zheng
Yijie Zheng
Msc. Student, University of Chinese Academy of Sciences
Remote SensingDeep LearningVision-Language ModelWildlife ConservationOrienteering Mapping
Weijie Wu
Weijie Wu
Roblox
Computer Networks
Qingyun Li
Qingyun Li
University of Electronic Science and Technology of China
wireless communicationsinformation theory
Xuehui Wang
Xuehui Wang
PhD Candidate, Shanghai Jiao Tong University
Computer VisionSegmentationDetection
X
Xu Zhou
University of Wollongong
A
Aiai Ren
University of Wollongong
J
Jun Shen
University of Wollongong
L
Long Zhao
Aerospace Information Research Institute
G
Guoqing Li
Aerospace Information Research Institute
X
Xue Yang
Shanghai Jiao Tong University