UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface

📅 2025-03-03

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Fine-grained visual perception tasks (e.g., detection, segmentation) traditionally rely on task-specific architectures, limiting generalization and modularity. To address this, we propose UFO, the first unified modeling framework for fine-grained vision grounded in an open-language interface: it maps all visual targets—bounding boxes, masks, and referring phrases—into a shared language space, enabling pixel-level segmentation solely via text prompting and embedding retrieval, without modifying the backbone or introducing task-specific heads. UFO integrates cross-modal alignment and multi-task joint training, and natively supports plug-and-play integration with mainstream multimodal large language models (MLLMs). On COCO instance segmentation, UFO achieves a +12.3 mAP gain; on ADE20K semantic segmentation, it improves mIoU by +3.3. These results significantly surpass existing general-purpose vision models, empirically validating the effectiveness and scalability of unifying fine-grained perception within a language-centric representation space.

Technology Category

Application Category

📝 Abstract

Generalist models have achieved remarkable success in both language and vision-language tasks, showcasing the potential of unified modeling. However, effectively integrating fine-grained perception tasks like detection and segmentation into these models remains a significant challenge. This is primarily because these tasks often rely heavily on task-specific designs and architectures that can complicate the modeling process. To address this challenge, we present ours, a framework that extbf{U}nifies extbf{F}ine-grained visual perception tasks through an extbf{O}pen-ended language interface. By transforming all perception targets into the language space, ours unifies object-level detection, pixel-level segmentation, and image-level vision-language tasks into a single model. Additionally, we introduce a novel embedding retrieval approach that relies solely on the language interface to support segmentation tasks. Our framework bridges the gap between fine-grained perception and vision-language tasks, significantly simplifying architectural design and training strategies while achieving comparable or superior performance to methods with intricate task-specific designs. After multi-task training on five standard visual perception datasets, ours outperforms the previous state-of-the-art generalist models by 12.3 mAP on COCO instance segmentation and 3.3 mIoU on ADE20K semantic segmentation. Furthermore, our method seamlessly integrates with existing MLLMs, effectively combining fine-grained perception capabilities with their advanced language abilities, thereby enabling more challenging tasks such as reasoning segmentation. Code and models are available at https://github.com/nnnth/UFO.

Problem

Research questions and friction points this paper is trying to address.

Integrates fine-grained visual perception tasks using language.

Unifies detection, segmentation, and vision-language tasks in one model.

Simplifies architecture and training for visual perception tasks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies fine-grained perception via language interface

Transforms perception targets into language space

Introduces embedding retrieval for segmentation tasks

🔎 Similar Papers

No similar papers found.

Authors to Follow