SAM 3: Segment Anything with Concepts

📅 2025-11-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper introduces Promptable Concept Segmentation (PCS), a novel task that unifies object detection, segmentation, and tracking in images and videos via concept prompts—such as noun phrases or exemplar images. Methodologically: (1) we construct a large-scale dataset with 4 million concept labels, decoupling recognition from localization and introducing an existence prediction head; (2) we design a shared single-backbone image detector and a memory-based video tracker, integrating concept prompting mechanisms with a high-quality data engine. Our contributions are threefold: first, we formally define the PCS task and establish SA-Co, a new benchmark for it; second, our method achieves double the accuracy of prior state-of-the-art methods on both image and video PCS benchmarks; third, it surpasses SAM on multiple segmentation benchmarks. All models and benchmarks are publicly released.

Technology Category

Application Category

📝 Abstract
We present Segment Anything Model (SAM) 3, a unified model that detects, segments, and tracks objects in images and videos based on concept prompts, which we define as either short noun phrases (e.g., "yellow school bus"), image exemplars, or a combination of both. Promptable Concept Segmentation (PCS) takes such prompts and returns segmentation masks and unique identities for all matching object instances. To advance PCS, we build a scalable data engine that produces a high-quality dataset with 4M unique concept labels, including hard negatives, across images and videos. Our model consists of an image-level detector and a memory-based video tracker that share a single backbone. Recognition and localization are decoupled with a presence head, which boosts detection accuracy. SAM 3 doubles the accuracy of existing systems in both image and video PCS, and improves previous SAM capabilities on visual segmentation tasks. We open source SAM 3 along with our new Segment Anything with Concepts (SA-Co) benchmark for promptable concept segmentation.
Problem

Research questions and friction points this paper is trying to address.

Detects segments and tracks objects using concept prompts
Creates scalable dataset with 4M unique concept labels
Improves segmentation accuracy while decoupling recognition localization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses concept prompts for object detection and segmentation
Implements scalable data engine with 4M concept labels
Decouples recognition and localization with presence head
🔎 Similar Papers
No similar papers found.
Nicolas Carion
Nicolas Carion
Meta
intelligence artificiellereinforcement learningdeep learningcomputer visionself supervised learning
Laura Gustafson
Laura Gustafson
Facebook AI Research
Artificial Intelligence
Yuan-Ting Hu
Yuan-Ting Hu
Research Scientist, FAIR, Meta AI
computer visionmachine learning
Shoubhik Debnath
Shoubhik Debnath
FAIR, AI at Meta
Computer VisionDeep LearningRoboticsReinforcement Learning
Ronghang Hu
Ronghang Hu
Research Scientist, AI at Meta
Computer VisionNatural Language ProcessingMachine Learning
D
Didac Suris
Meta Superintelligence Labs
C
Chaitanya Ryali
Meta Superintelligence Labs
Kalyan Vasudev Alwala
Kalyan Vasudev Alwala
Research Engineer, Meta
Computer VisionMachine LearningRobotics
H
Haitham Khedr
Meta Superintelligence Labs
Andrew Huang
Andrew Huang
MIT
Jie Lei
Jie Lei
Universitat Politècnica de València
Computer EngineeringElectronic engineering
T
Tengyu Ma
Meta Superintelligence Labs
Baishan Guo
Baishan Guo
Meta AI
A
Arpit Kalla
Meta Superintelligence Labs
Markus Marks
Markus Marks
Research Scientist at FAIR (Meta AI)
Computer VisionMachine LearningAI4ScienceNeuroscience
J
Joseph Greer
Meta Superintelligence Labs
M
Meng Wang
Meta Superintelligence Labs
Peize Sun
Peize Sun
Meta FAIR; HKU
Computer VisionDeep Learning
Roman Rädle
Roman Rädle
Meta
human-computer interactionartificial intelligencecomputer vision
Triantafyllos Afouras
Triantafyllos Afouras
FAIR, Meta, University of Oxford
Computer VisionMachine LearningArtificial Intelligence
Effrosyni Mavroudi
Effrosyni Mavroudi
Research Scientist, FAIR, Meta AI
Computer VisionMachine LearningVideo Understanding
Katherine Xu
Katherine Xu
Massachusetts Institute of Technology
Computer VisionMachine LearningArtificial Intelligence
Tsung-Han Wu
Tsung-Han Wu
PhD Student, UC Berkeley
Vision and LanguageComputer VisionActive Learning
Y
Yu Zhou
Meta Superintelligence Labs
Liliane Momeni
Liliane Momeni
University of Oxford
Computer VisionMachine LearningArtificial Intelligence