SAM3-I: Segment Anything with Instructions

📅 2025-12-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing open-vocabulary segmentation models (e.g., SAM3) support only noun-phrase-level concept segmentation and struggle to interpret complex natural language instructions involving attributes, spatial relations, functional descriptions, or implicit reasoning. To address this, we propose the first end-to-end instruction-aware image segmentation framework, which jointly performs vision-language representation alignment, cascaded instruction adaptation, and multimodal instruction encoding to directly parse diverse instructions—from simple nouns to compositional semantics. We introduce a structured instruction taxonomy and a scalable generative data engine, enabling fine-grained instruction-following segmentation without compromising original open-vocabulary segmentation capability. Experiments demonstrate substantial improvements in accuracy and generalization on complex instruction segmentation tasks. The code and dataset are publicly released, supporting domain-specific fine-tuning.

Technology Category

Application Category

📝 Abstract
Segment Anything Model 3 (SAM3) has advanced open-vocabulary segmentation through promptable concept segmentation, allowing users to segment all instances corresponding to a given concept, typically specified with short noun-phrase (NP) prompts. While this marks the first integration of language-level concepts within the SAM family, real-world usage typically requires far richer expressions that include attributes, spatial relations, functionalities, actions, states, and even implicit reasoning over instances. Currently, SAM3 relies on external multi-modal agents to convert complex instructions into NPs and then conduct iterative mask filtering. However, these NP-level concepts remain overly coarse, often failing to precisely represent a specific instance. In this work, we present SAM3-I, an enhanced framework that unifies concept-level understanding and instruction-level reasoning within the SAM family. SAM3-I introduces an instruction-aware cascaded adaptation mechanism that progressively aligns expressive instruction semantics with SAM3's existing vision-language representations, enabling direct instruction-following segmentation without sacrificing its original concept-driven capabilities. Furthermore, we design a structured instruction taxonomy spanning concept, simple, and complex levels, and develop a scalable data engine to construct a dataset with diverse instruction-mask pairs. Experiments show that SAM3-I delivers appealing performance, demonstrating that SAM3 can be effectively extended to follow natural-language instructions while preserving its strong concept grounding. We open-source SAM3-I and provide practical fine-tuning workflows, enabling researchers to adapt it to domain-specific applications. The source code is available here.
Problem

Research questions and friction points this paper is trying to address.

Enhances SAM3 to handle complex instructions beyond simple noun phrases
Unifies concept-level understanding with instruction-level reasoning for segmentation
Enables direct instruction-following segmentation while preserving concept-driven capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Instruction-aware cascaded adaptation aligns semantics with vision-language representations
Structured instruction taxonomy spans concept, simple, and complex levels
Scalable data engine constructs diverse instruction-mask pairs dataset
🔎 Similar Papers
No similar papers found.
J
Jingjing Li
University of Alberta
Y
Yue Feng
NUAA
Yuchen Guo
Yuchen Guo
Tsinghua University
Machine LearningComputer VisionInformation Retrieval
J
Jincai Huang
SUSTech
Y
Yongri Piao
Dalian University of Technology
Q
Qi Bi
Utrecht University
M
Miao Zhang
Dalian University of Technology
Xiaoqi Zhao
Xiaoqi Zhao
Yale University
Computer VisionAi4IndustryAi4HealthEV Safety
Q
Qiang Chen
Independent Researcher
Shihao Zou
Shihao Zou
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
computer vision
W
Wei Ji
Yale University
H
Huchuan Lu
Dalian University of Technology
L
Li Cheng
University of Alberta