Follow-Your-Instruction: A Comprehensive MLLM Agent for World Data Synthesis

📅 2025-08-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high cost and long latency in acquiring high-quality multimodal data for AIGC, this paper proposes the first end-to-end world data synthesis framework powered by multimodal large language models (MLLMs), unifying automated generation of 2D images, 3D scenes, and 4D dynamic sequences. Methodologically, it employs instruction-driven pipelines: MLLM-Collector for asset acquisition, MLLM-Generator and MLLM-Optimizer for semantic-consistent 3D layout generation and multi-view optimization, and MLLM-Planner for temporally coherent future-frame prediction. Its core contribution is the first deep integration of MLLMs across the full-stack data synthesis pipeline—preserving geometric and semantic fidelity while substantially improving diversity and scalability. Experiments demonstrate that the synthesized data consistently boosts performance on downstream 2D, 3D, and 4D vision tasks, validating its efficacy and generalizability as a universal data engine.

Technology Category

Application Category

📝 Abstract
With the growing demands of AI-generated content (AIGC), the need for high-quality, diverse, and scalable data has become increasingly crucial. However, collecting large-scale real-world data remains costly and time-consuming, hindering the development of downstream applications. While some works attempt to collect task-specific data via a rendering process, most approaches still rely on manual scene construction, limiting their scalability and accuracy. To address these challenges, we propose Follow-Your-Instruction, a Multimodal Large Language Model (MLLM)-driven framework for automatically synthesizing high-quality 2D, 3D, and 4D data. Our extbf{Follow-Your-Instruction} first collects assets and their associated descriptions through multimodal inputs using the MLLM-Collector. Then it constructs 3D layouts, and leverages Vision-Language Models (VLMs) for semantic refinement through multi-view scenes with the MLLM-Generator and MLLM-Optimizer, respectively. Finally, it uses MLLM-Planner to generate temporally coherent future frames. We evaluate the quality of the generated data through comprehensive experiments on the 2D, 3D, and 4D generative tasks. The results show that our synthetic data significantly boosts the performance of existing baseline models, demonstrating Follow-Your-Instruction's potential as a scalable and effective data engine for generative intelligence.
Problem

Research questions and friction points this paper is trying to address.

High-quality diverse scalable data synthesis for AIGC
Automating manual scene construction in data generation
Enhancing generative models with multimodal synthetic data
Innovation

Methods, ideas, or system contributions that make the work stand out.

MLLM-driven framework for data synthesis
Multimodal inputs for asset collection
VLMs for semantic refinement and planning
🔎 Similar Papers
K
Kunyu Feng
HKUST(GZ)
Yue Ma
Yue Ma
Bytedance
NLPDialogue SystemLLM
Xinhua Zhang
Xinhua Zhang
University of Illinois Chicago
Machine Learning
B
Boshi Liu
Peking University
Y
Yikuang Yuluo
Chongqing University
Y
Yinhan Zhang
HKUST(GZ)
Runtao Liu
Runtao Liu
Hong Kong University of Science and Technology
computer visionai safetyRLHFreasoning
Hongyu Liu
Hongyu Liu
HKUST
Computer Vision
Z
Zhiyuan Qin
Beijing Innovation Center of Humanoid Robotics
S
Shanhui Mo
HKUST
Qifeng Chen
Qifeng Chen
HKUST
Computational PhotographyImage SynthesisGenerative AIAutonomous DrivingEmbodied AI
Z
Zeyu Wang
HKUST(GZ), HKUST