🤖 AI Summary
Existing 3D-native generative models predominantly rely on image or text conditioning, lacking fine-grained cross-modal control over geometry, topology, and pose—limiting controllability in industrial applications. To address this, we propose a unified cross-modal 3D generation framework supporting diverse conditioning inputs, including images, point clouds, voxels, bounding boxes, and skeletal poses, enabling joint, fine-grained control of geometric structure and semantic pose. Built upon Hunyuan3D 2.1, our method introduces a cross-modal fusion network and a difficulty-aware progressive sampling strategy to enhance robustness under complex inputs and improve multimodal coordination. Experiments demonstrate that multi-condition joint control significantly improves generation accuracy and shape fidelity, while enabling geometry-aware controllable deformation. The framework exhibits superior stability and practicality in production pipelines for gaming, film, and visual effects.
📝 Abstract
Recent advances in 3D-native generative models have accelerated asset creation for games, film, and design. However, most methods still rely primarily on image or text conditioning and lack fine-grained, cross-modal controls, which limits controllability and practical adoption. To address this gap, we present Hunyuan3D-Omni, a unified framework for fine-grained, controllable 3D asset generation built on Hunyuan3D 2.1. In addition to images, Hunyuan3D-Omni accepts point clouds, voxels, bounding boxes, and skeletal pose priors as conditioning signals, enabling precise control over geometry, topology, and pose. Instead of separate heads for each modality, our model unifies all signals in a single cross-modal architecture. We train with a progressive, difficulty-aware sampling strategy that selects one control modality per example and biases sampling toward harder signals (e.g., skeletal pose) while downweighting easier ones (e.g., point clouds), encouraging robust multi-modal fusion and graceful handling of missing inputs. Experiments show that these additional controls improve generation accuracy, enable geometry-aware transformations, and increase robustness for production workflows.