UniMC: Taming Diffusion Transformer for Unified Keypoint-Guided Multi-Class Image Generation

📅 2025-07-03

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Existing keypoint-guided text-to-image diffusion models struggle with precise generation of non-rigid objects—especially animals—under complex scenarios involving multiple overlapping instances and severe occlusion, primarily due to insufficient conditional modeling capacity and the lack of large-scale, high-quality multi-class keypoint datasets. To address this, we propose UniMC, a unified diffusion Transformer architecture that encodes class labels, bounding boxes, and keypoints into compact, instance-aware conditional tokens, enabling fine-grained, joint control over semantic and geometric attributes. Furthermore, we introduce HAIG-2.9M—the first large-scale, multi-class human-animal keypoint image dataset comprising 2.9 million samples. Extensive experiments demonstrate that UniMC significantly outperforms state-of-the-art methods on challenging tasks involving heavy occlusion and mixed-category generation, validating both the architectural innovation and the dataset’s effectiveness and generalizability for controllable synthesis of non-rigid objects.

Technology Category

Application Category

📝 Abstract

Although significant advancements have been achieved in the progress of keypoint-guided Text-to-Image diffusion models, existing mainstream keypoint-guided models encounter challenges in controlling the generation of more general non-rigid objects beyond humans (e.g., animals). Moreover, it is difficult to generate multiple overlapping humans and animals based on keypoint controls solely. These challenges arise from two main aspects: the inherent limitations of existing controllable methods and the lack of suitable datasets. First, we design a DiT-based framework, named UniMC, to explore unifying controllable multi-class image generation. UniMC integrates instance- and keypoint-level conditions into compact tokens, incorporating attributes such as class, bounding box, and keypoint coordinates. This approach overcomes the limitations of previous methods that struggled to distinguish instances and classes due to their reliance on skeleton images as conditions. Second, we propose HAIG-2.9M, a large-scale, high-quality, and diverse dataset designed for keypoint-guided human and animal image generation. HAIG-2.9M includes 786K images with 2.9M instances. This dataset features extensive annotations such as keypoints, bounding boxes, and fine-grained captions for both humans and animals, along with rigorous manual inspection to ensure annotation accuracy. Extensive experiments demonstrate the high quality of HAIG-2.9M and the effectiveness of UniMC, particularly in heavy occlusions and multi-class scenarios.

Problem

Research questions and friction points this paper is trying to address.

Control non-rigid object generation beyond humans

Generate multiple overlapping humans and animals

Overcome dataset and method limitations

Innovation

Methods, ideas, or system contributions that make the work stand out.

DiT-based framework for unified multi-class generation

Compact tokens integrate instance and keypoint conditions

Large-scale HAIG-2.9M dataset with detailed annotations

🔎 Similar Papers

No similar papers found.