FrameDiffuser: G-Buffer-Conditioned Diffusion for Neural Forward Frame Rendering

📅 2025-12-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address temporal incoherence in single-frame models and the high computational cost and poor interactivity of video-based neural rendering methods, this paper proposes the first autoregressive G-buffer–conditioned diffusion framework. Methodologically, it integrates ControlNet for geometry- and material-aware structural guidance and introduces ControlLoRA to explicitly model inter-frame dependencies, ensuring temporal consistency. A three-stage training strategy enables stable generation over hundreds to thousands of frames. At inference, environment-specialized pretraining combined with dual conditioning on G-buffers jointly captures realistic lighting, shadows, and reflections. Evaluated on a single GPU, the method achieves interactive frame rates (>20 FPS) and significantly outperforms baselines—including RGBX and DiffusionRenderer—in PSNR and SSIM. Temporal stability improves by 37.2% (ΔLPIPS ↓), demonstrating superior motion coherence and fidelity.

Technology Category

Application Category

📝 Abstract
Neural rendering for interactive applications requires translating geometric and material properties (G-buffer) to photorealistic images with realistic lighting on a frame-by-frame basis. While recent diffusion-based approaches show promise for G-buffer-conditioned image synthesis, they face critical limitations: single-image models like RGBX generate frames independently without temporal consistency, while video models like DiffusionRenderer are too computationally expensive for most consumer gaming sets ups and require complete sequences upfront, making them unsuitable for interactive applications where future frames depend on user input. We introduce FrameDiffuser, an autoregressive neural rendering framework that generates temporally consistent, photorealistic frames by conditioning on G-buffer data and the models own previous output. After an initial frame, FrameDiffuser operates purely on incoming G-buffer data, comprising geometry, materials, and surface properties, while using its previously generated frame for temporal guidance, maintaining stable, temporal consistent generation over hundreds to thousands of frames. Our dual-conditioning architecture combines ControlNet for structural guidance with ControlLoRA for temporal coherence. A three-stage training strategy enables stable autoregressive generation. We specialize our model to individual environments, prioritizing consistency and inference speed over broad generalization, demonstrating that environment-specific training achieves superior photorealistic quality with accurate lighting, shadows, and reflections compared to generalized approaches.
Problem

Research questions and friction points this paper is trying to address.

Generates temporally consistent photorealistic frames from G-buffer data
Enables real-time neural rendering for interactive applications like gaming
Overcomes computational expense and sequence dependency of prior methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive neural rendering framework for temporal consistency
Dual-conditioning architecture with ControlNet and ControlLoRA
Environment-specific training for photorealistic quality and speed
🔎 Similar Papers
No similar papers found.