Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing 3D-native generative models predominantly rely on image or text conditioning, lacking fine-grained cross-modal control over geometry, topology, and pose—limiting controllability in industrial applications. To address this, we propose a unified cross-modal 3D generation framework supporting diverse conditioning inputs, including images, point clouds, voxels, bounding boxes, and skeletal poses, enabling joint, fine-grained control of geometric structure and semantic pose. Built upon Hunyuan3D 2.1, our method introduces a cross-modal fusion network and a difficulty-aware progressive sampling strategy to enhance robustness under complex inputs and improve multimodal coordination. Experiments demonstrate that multi-condition joint control significantly improves generation accuracy and shape fidelity, while enabling geometry-aware controllable deformation. The framework exhibits superior stability and practicality in production pipelines for gaming, film, and visual effects.

Technology Category

Application Category

📝 Abstract
Recent advances in 3D-native generative models have accelerated asset creation for games, film, and design. However, most methods still rely primarily on image or text conditioning and lack fine-grained, cross-modal controls, which limits controllability and practical adoption. To address this gap, we present Hunyuan3D-Omni, a unified framework for fine-grained, controllable 3D asset generation built on Hunyuan3D 2.1. In addition to images, Hunyuan3D-Omni accepts point clouds, voxels, bounding boxes, and skeletal pose priors as conditioning signals, enabling precise control over geometry, topology, and pose. Instead of separate heads for each modality, our model unifies all signals in a single cross-modal architecture. We train with a progressive, difficulty-aware sampling strategy that selects one control modality per example and biases sampling toward harder signals (e.g., skeletal pose) while downweighting easier ones (e.g., point clouds), encouraging robust multi-modal fusion and graceful handling of missing inputs. Experiments show that these additional controls improve generation accuracy, enable geometry-aware transformations, and increase robustness for production workflows.
Problem

Research questions and friction points this paper is trying to address.

Addresses limited controllability in 3D asset generation from text or images
Enables fine-grained control over geometry, topology, and pose using multiple inputs
Unifies various conditioning signals in a single cross-modal architecture
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified cross-modal architecture for all signals
Progressive difficulty-aware sampling strategy training
Handles multiple inputs like point clouds and poses
🔎 Similar Papers
No similar papers found.
T
Team Hunyuan3D
Tencent Hunyuan3D
B
Bowen Zhang
Tencent Hunyuan3D
C
Chunchao Guo
Tencent Hunyuan3D
H
Haolin Liu
Tencent Hunyuan3D
H
Hongyu Yan
Tencent Hunyuan3D
H
Huiwen Shi
Tencent Hunyuan3D
J
Jingwei Huang
Tencent Hunyuan3D
J
Junlin Yu
Tencent Hunyuan3D
Kunhong Li
Kunhong Li
中山大学
L
Linus
Tencent Hunyuan3D
P
Penghao Wang
Tencent Hunyuan3D
Q
Qingxiang Lin
Tencent Hunyuan3D
S
Sicong Liu
Tencent Hunyuan3D
X
Xianghui Yang
Tencent Hunyuan3D
Y
Yixuan Tang
Tencent Hunyuan3D
Yunfei Zhao
Yunfei Zhao
Peking University
intelligent programcode generationcode representation
Zeqiang Lai
Zeqiang Lai
CUHK | Tencent | BIT
Low Level VisionGenerated ModelsProximal Algorithm
Zhihao Liang
Zhihao Liang
South China University of Technology
Computer Vision and Pattern RecognitionMachine Learning
Zibo Zhao
Zibo Zhao
Hunyuan, Tencent; ShanghaiTech