YOLOE-26: Integrating YOLO26 with YOLOE for Real-Time Open-Vocabulary Instance Segmentation

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This work proposes a real-time, open-vocabulary, end-to-end instance segmentation framework that overcomes the limitation of conventional models restricted to closed-set categories. By integrating the efficient architecture of YOLO26 with the open-vocabulary learning paradigm of YOLOE, the method eliminates non-maximum suppression (NMS) and replaces fixed classification logits with an object embedding head. It further introduces three key components: Reparameterizable Region-Text Alignment (RepRTA), Semantic-Activated Visual Prompt Encoder (SAVPE), and Lazy Region Prompt Contrast, enabling unified support for text prompts, visual exemplar guidance, and prompt-free inference. Built upon a convolutional backbone with PAN/FPN multi-scale feature fusion, the model achieves an excellent balance between accuracy and efficiency across various input resolutions, is compatible with the Ultralytics ecosystem, and is well-suited for real-time, multi-task instance segmentation in dynamic real-world scenarios.

Technology Category

Application Category

📝 Abstract

This paper presents YOLOE-26, a unified framework that integrates the deployment-optimized YOLO26(or YOLOv26) architecture with the open-vocabulary learning paradigm of YOLOE for real-time open-vocabulary instance segmentation. Building on the NMS-free, end-to-end design of YOLOv26, the proposed approach preserves the hallmark efficiency and determinism of the YOLO family while extending its capabilities beyond closed-set recognition. YOLOE-26 employs a convolutional backbone with PAN/FPN-style multi-scale feature aggregation, followed by end-to-end regression and instance segmentation heads. A key architectural contribution is the replacement of fixed class logits with an object embedding head, which formulates classification as similarity matching against prompt embeddings derived from text descriptions, visual examples, or a built-in vocabulary. To enable efficient open-vocabulary reasoning, the framework incorporates Re-Parameterizable Region-Text Alignment (RepRTA) for zero-overhead text prompting, a Semantic-Activated Visual Prompt Encoder (SAVPE) for example-guided segmentation, and Lazy Region Prompt Contrast for prompt-free inference. All prompting modalities operate within a unified object embedding space, allowing seamless switching between text-prompted, visual-prompted, and fully autonomous segmentation. Extensive experiments demonstrate consistent scaling behavior and favorable accuracy-efficiency trade-offs across model sizes in both prompted and prompt-free settings. The training strategy leverages large-scale detection and grounding datasets with multi-task optimization and remains fully compatible with the Ultralytics ecosystem for training, validation, and deployment. Overall, YOLOE-26 provides a practical and scalable solution for real-time open-vocabulary instance segmentation in dynamic, real-world environments.

Problem

Research questions and friction points this paper is trying to address.

open-vocabulary instance segmentation

real-time segmentation

zero-shot object recognition

prompt-based segmentation

YOLO

Innovation

Methods, ideas, or system contributions that make the work stand out.

open-vocabulary instance segmentation

object embedding head

Re-Parameterizable Region-Text Alignment