SAM3-I: Segment Anything with Instructions

📅 2025-12-04

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Existing open-vocabulary segmentation models (e.g., SAM3) support only noun-phrase-level concept segmentation and struggle to interpret complex natural language instructions involving attributes, spatial relations, functional descriptions, or implicit reasoning. To address this, we propose the first end-to-end instruction-aware image segmentation framework, which jointly performs vision-language representation alignment, cascaded instruction adaptation, and multimodal instruction encoding to directly parse diverse instructions—from simple nouns to compositional semantics. We introduce a structured instruction taxonomy and a scalable generative data engine, enabling fine-grained instruction-following segmentation without compromising original open-vocabulary segmentation capability. Experiments demonstrate substantial improvements in accuracy and generalization on complex instruction segmentation tasks. The code and dataset are publicly released, supporting domain-specific fine-tuning.

Technology Category

Application Category

📝 Abstract

Segment Anything Model 3 (SAM3) has advanced open-vocabulary segmentation through promptable concept segmentation, allowing users to segment all instances corresponding to a given concept, typically specified with short noun-phrase (NP) prompts. While this marks the first integration of language-level concepts within the SAM family, real-world usage typically requires far richer expressions that include attributes, spatial relations, functionalities, actions, states, and even implicit reasoning over instances. Currently, SAM3 relies on external multi-modal agents to convert complex instructions into NPs and then conduct iterative mask filtering. However, these NP-level concepts remain overly coarse, often failing to precisely represent a specific instance. In this work, we present SAM3-I, an enhanced framework that unifies concept-level understanding and instruction-level reasoning within the SAM family. SAM3-I introduces an instruction-aware cascaded adaptation mechanism that progressively aligns expressive instruction semantics with SAM3's existing vision-language representations, enabling direct instruction-following segmentation without sacrificing its original concept-driven capabilities. Furthermore, we design a structured instruction taxonomy spanning concept, simple, and complex levels, and develop a scalable data engine to construct a dataset with diverse instruction-mask pairs. Experiments show that SAM3-I delivers appealing performance, demonstrating that SAM3 can be effectively extended to follow natural-language instructions while preserving its strong concept grounding. We open-source SAM3-I and provide practical fine-tuning workflows, enabling researchers to adapt it to domain-specific applications. The source code is available here.

Problem

Research questions and friction points this paper is trying to address.

Enhances SAM3 to handle complex instructions beyond simple noun phrases

Unifies concept-level understanding with instruction-level reasoning for segmentation

Enables direct instruction-following segmentation while preserving concept-driven capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Instruction-aware cascaded adaptation aligns semantics with vision-language representations

Structured instruction taxonomy spans concept, simple, and complex levels

Scalable data engine constructs diverse instruction-mask pairs dataset

🔎 Similar Papers

On Efficient Variants of Segment Anything Model: A Survey