HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds

📅 2026-04-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

233K/year
🤖 AI Summary
This work addresses the challenge of efficiently generating high-fidelity, navigable 3D scenes from multimodal inputs—including text, single images, multi-view images, or videos—by introducing a four-stage pipeline: panoramic generation, trajectory planning, world expansion, and synthesis. The framework integrates several advanced components: HY-Pano 2.0, WorldNav, WorldStereo 2.0, WorldMirror 2.0, and the high-performance rendering platform WorldLens. Notably, WorldStereo 2.0 employs a keyframe-based view synthesis mechanism with consistent memory, while WorldMirror 2.0 serves as a general-purpose 3D prediction model. Built upon 3D Gaussian Splatting, the entire system achieves state-of-the-art performance on multiple open-source benchmarks, matching the quality of the closed-source model Marble. The authors have publicly released all code, model weights, and technical details.

Technology Category

Application Category

📝 Abstract
We introduce HY-World 2.0, a multi-modal world model framework that advances our prior project HY-World 1.0. HY-World 2.0 accommodates diverse input modalities, including text prompts, single-view images, multi-view images, and videos, and produces 3D world representations. With text or single-view image inputs, the model performs world generation, synthesizing high-fidelity, navigable 3D Gaussian Splatting (3DGS) scenes. This is achieved through a four-stage method: a) Panorama Generation with HY-Pano 2.0, b) Trajectory Planning with WorldNav, c) World Expansion with WorldStereo 2.0, and d) World Composition with WorldMirror 2.0. Specifically, we introduce key innovations to enhance panorama fidelity, enable 3D scene understanding and planning, and upgrade WorldStereo, our keyframe-based view generation model with consistent memory. We also upgrade WorldMirror, a feed-forward model for universal 3D prediction, by refining model architecture and learning strategy, enabling world reconstruction from multi-view images or videos. Also, we introduce WorldLens, a high-performance 3DGS rendering platform featuring a flexible engine-agnostic architecture, automatic IBL lighting, efficient collision detection, and training-rendering co-design, enabling interactive exploration of 3D worlds with character support. Extensive experiments demonstrate that HY-World 2.0 achieves state-of-the-art performance on several benchmarks among open-source approaches, delivering results comparable to the closed-source model Marble. We release all model weights, code, and technical details to facilitate reproducibility and support further research on 3D world models.
Problem

Research questions and friction points this paper is trying to address.

3D world generation
multi-modal input
3D reconstruction
world simulation
3D Gaussian Splatting
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-modal world model
3D Gaussian Splatting
panorama generation
consistent view synthesis
interactive 3D rendering
🔎 Similar Papers
No similar papers found.
T
Team HY-World
Tencent Hunyuan
Chenjie Cao
Chenjie Cao
Alibaba DAMO Academy
image inpaintingmulti-view stereonovel view synthesis
X
Xuhui Zuo
Tencent Hunyuan
Z
Zhenwei Wang
Tencent Hunyuan
Y
Yisu Zhang
Tencent Hunyuan
J
Junta Wu
Tencent Hunyuan
Z
Zhenyang Liu
Tencent Hunyuan
Y
Yuning Gong
Tencent Hunyuan
Yang Liu
Yang Liu
Microsoft
natural language processingtext summarizationtext generation
Bo Yuan
Bo Yuan
PhD Student in Machine Learning, Georgia Institute of Technology
Markov chain Monte CarloLarge Language Model
Chao Zhang
Chao Zhang
Alibaba
C
Coopers Li
Tencent Hunyuan
D
Dongyuan Guo
Tencent Hunyuan
Fan Yang
Fan Yang
Tencent AI Lab
AIPrecision MedicineSingle Cell MultiomicsBioinformaticse-mail:fan.yang.zone@gmail.com
Haiyu Zhang
Haiyu Zhang
Beihang University
Neural Fields
H
Hang Cao
Tencent Hunyuan
J
Jianchen Zhu
Tencent Hunyuan
Jiaxin Lin
Jiaxin Lin
The University of Texas at Austin
Computer Science
Jie Xiao
Jie Xiao
University of Science and Technology of China
low level visiongenerative modelmachine learning
J
Jihong Zhang
Tencent Hunyuan
J
Junlin Yu
Tencent Hunyuan
L
Lei Wang
Tencent Hunyuan
L
Lifu Wang
Tencent Hunyuan
L
Lilin Wang
Tencent Hunyuan
L
Linus
Tencent Hunyuan