DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks

📅 2025-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
General-purpose visual perception under constrained computational resources and scarce labeled data remains challenging for multi-task learning. Method: We propose the first diffusion-based unified perception framework, reformulating segmentation, detection, and other tasks as conditional image generation. We introduce a novel stochastic color coding scheme to represent diverse task outputs and leverage pre-trained text-to-image diffusion models (e.g., Stable Diffusion), enabling adaptation to new tasks with only 50 annotated images and 1% parameter updates. Results: Our model matches SAM-vit-h’s performance across multiple tasks while requiring merely 0.06% pixel-level annotations (600K vs. 1B), reducing training cost by several orders of magnitude. It significantly advances generalizable perception in low-shot, low-overhead regimes.

Technology Category

Application Category

📝 Abstract
Our primary goal here is to create a good, generalist perception model that can tackle multiple tasks, within limits on computational resources and training data. To achieve this, we resort to text-to-image diffusion models pre-trained on billions of images. Our exhaustive evaluation metrics demonstrate that DICEPTION effectively tackles multiple perception tasks, achieving performance on par with state-of-the-art models. We achieve results on par with SAM-vit-h using only 0.06% of their data (e.g., 600K vs. 1B pixel-level annotated images). Inspired by Wang et al., DICEPTION formulates the outputs of various perception tasks using color encoding; and we show that the strategy of assigning random colors to different instances is highly effective in both entity segmentation and semantic segmentation. Unifying various perception tasks as conditional image generation enables us to fully leverage pre-trained text-to-image models. Thus, DICEPTION can be efficiently trained at a cost of orders of magnitude lower, compared to conventional models that were trained from scratch. When adapting our model to other tasks, it only requires fine-tuning on as few as 50 images and 1% of its parameters. DICEPTION provides valuable insights and a more promising solution for visual generalist models. Homepage: https://aim-uofa.github.io/Diception, Huggingface Demo: https://huggingface.co/spaces/Canyu/Diception-Demo.
Problem

Research questions and friction points this paper is trying to address.

Develop generalist visual perception model
Leverage pre-trained text-to-image diffusion models
Achieve state-of-the-art performance with minimal data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes pre-trained text-to-image diffusion
Employs color encoding for task outputs
Requires minimal fine-tuning for adaptation
🔎 Similar Papers
No similar papers found.