InstructSAM: Segment Any Instance with Any Instructions

📅 2026-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing approaches in compositional reasoning, redundant prediction, and inference efficiency by proposing an instruction-driven multi-instance segmentation method. The approach formulates the task as a set-structured query prediction problem, leveraging learnable instance queries to bridge a vision-language model with SAM3. By introducing an explicit reasoning interface and a hybrid attention mechanism—without modifying SAM3’s core architecture—the method endows SAM3 with advanced instruction comprehension, compositional reasoning, and set prediction capabilities. Using only a 2B-scale model, it outperforms current end-to-end methods and SAM3-based agent pipelines on benchmarks involving complex instructions and phrase-level referring expression segmentation, achieving accurate multi-instance segmentation in a single, efficient forward pass.
📝 Abstract
In this paper, we introduce InstructSAM, a unified and streamlined framework designed for multi-instance segmentation under arbitrary instructions. We formulates instruction-driven instance segmentation as a set-structured query prediction problem and propose an explicit reasoning-to-instance query interface that elegantly bridges a vision-language model (VLM) and SAM3. Specifically, a bank of learnable instance queries is injected into the VLM and contextualized with instruction and visual information, enabling each query to serve as an instance-aware slot. A hybrid-attention mechanism further promotes interaction among these queries, visual tokens, and instruction tokens, improving instance enumeration and reducing duplicate predictions. The resulting LLM-conditioned queries are projected into SAM3's detector query space to drive accurate multi-instance segmentation in a single forward pass. This design equips SAM3 with high-level instruction understanding, compositional reasoning, and instance-level set prediction without modifying its core architecture. To support training and evaluation, we further construct Inst2Seg, a high-quality and large-scale instruction-based instance segmentation dataset and benchmark that couples free-form instructions with instance-level masks. Extensive experiments show that only 2B-scale InstructSAM achieves strong results across complex instruction-driven and phrase-level referring segmentation benchmarks, outperforming prior end-to-end methods and SAM3's agentic pipeline while enabling efficient single-pass multi-instance prediction.
Problem

Research questions and friction points this paper is trying to address.

instruction-driven segmentation
instance segmentation
multi-instance segmentation
vision-language model
referring segmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

instruction-driven segmentation
instance-aware queries
vision-language model
set prediction
hybrid attention
🔎 Similar Papers