🤖 AI Summary
Existing methods typically generate terrain heightmaps and textures in isolation, neglecting their intrinsic geometric-appearance coupling, thereby limiting visual realism. To address this, we propose the first unsupervised framework for joint heightmap and texture generation, explicitly modeling geometric-appearance coherency. Our approach builds upon a latent diffusion model (LDM), pre-trained on unlabeled paired data to enforce cross-modal consistency without supervision, and incorporates a lightweight external adapter to enable sketch-guided controllable synthesis. Crucially, the framework requires no paired annotations, yet achieves semantically coherent, high-fidelity, and diverse terrain synthesis. Extensive qualitative and quantitative evaluations demonstrate significant improvements over unimodal baselines across multiple metrics. The method shows strong practical utility for high-fidelity terrain modeling in games and film production.
📝 Abstract
3D terrain models are essential in fields such as video game development and film production. Since surface color often correlates with terrain geometry, capturing this relationship is crucial to achieving realism. However, most existing methods generate either a heightmap or a texture, without sufficiently accounting for the inherent correlation. In this paper, we propose a method that jointly generates terrain heightmaps and textures using a latent diffusion model. First, we train the model in an unsupervised manner to randomly generate paired heightmaps and textures. Then, we perform supervised learning of an external adapter to enable user control via hand-drawn sketches. Experiments show that our approach allows intuitive terrain generation while preserving the correlation between heightmaps and textures.