🤖 AI Summary
Existing synthetic datasets suffer from insufficient photorealism and temporal coherence, limiting the applicability of generative inverse and forward rendering in real-world scenarios. This work proposes a large-scale dynamic video dataset derived from AAA-grade games, employing a dual-screen capture method to simultaneously acquire RGB frames and five-channel G-buffers, thereby disentangling geometry from material properties. Building upon this dataset, we introduce a ground-truth-free evaluation protocol leveraging vision-language models (VLMs) and develop a text-prompt-driven framework for G-buffer style editing. Experiments demonstrate that our approach significantly outperforms existing methods in cross-dataset generalization and controllable generation, with VLM-based evaluations showing strong alignment with human judgments. We also release an open-source toolkit enabling real-time, high-fidelity style transfer for game rendering.
📝 Abstract
Scaling generative inverse and forward rendering to real-world scenarios is bottlenecked by the limited realism and temporal coherence of existing synthetic datasets. To bridge this persistent domain gap, we introduce a large-scale, dynamic dataset curated from visually complex AAA games. Using a novel dual-screen stitched capture method, we extracted 4M continuous frames (720p/30 FPS) of synchronized RGB and five G-buffer channels across diverse scenes, visual effects, and environments, including adverse weather and motion-blur variants. This dataset uniquely advances bidirectional rendering: enabling robust in-the-wild geometry and material decomposition, and facilitating high-fidelity G-buffer-guided video generation. Furthermore, to evaluate the real-world performance of inverse rendering without ground truth, we propose a novel VLM-based assessment protocol measuring semantic, spatial, and temporal consistency. Experiments demonstrate that inverse renderers fine-tuned on our data achieve superior cross-dataset generalization and controllable generation, while our VLM evaluation strongly correlates with human judgment. Combined with our toolkit, our forward renderer enables users to edit styles of AAA games from G-buffers using text prompts.