JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers

📅 2025-05-01

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work addresses the challenges of jointly modeling RGB and depth images and mitigating inconsistent cross-modal noise. We propose JointDiT—the first diffusion Transformer explicitly designed to model the joint RGB-depth distribution. Our method introduces (1) adaptive noise-level scheduling weights and imbalanced timestep sampling to enable synchronized denoising across modalities during diffusion, and (2) dual-branch asynchronous timestep control with modality-adaptive weighted fusion. JointDiT uniformly supports three tasks: joint RGB-depth generation, monocular depth estimation, and depth-guided image generation. In joint generation, it achieves state-of-the-art visual quality and geometric accuracy; in depth estimation and depth-guided generation, it matches the performance of dedicated single-task models. These results empirically validate the effectiveness and generality of joint distribution modeling as a principled alternative to conventional conditional generation paradigms.

Technology Category

Application Category

📝 Abstract

We present JointDiT, a diffusion transformer that models the joint distribution of RGB and depth. By leveraging the architectural benefit and outstanding image prior of the state-of-the-art diffusion transformer, JointDiT not only generates high-fidelity images but also produces geometrically plausible and accurate depth maps. This solid joint distribution modeling is achieved through two simple yet effective techniques that we propose, i.e., adaptive scheduling weights, which depend on the noise levels of each modality, and the unbalanced timestep sampling strategy. With these techniques, we train our model across all noise levels for each modality, enabling JointDiT to naturally handle various combinatorial generation tasks, including joint generation, depth estimation, and depth-conditioned image generation by simply controlling the timestep of each branch. JointDiT demonstrates outstanding joint generation performance. Furthermore, it achieves comparable results in depth estimation and depth-conditioned image generation, suggesting that joint distribution modeling can serve as a replaceable alternative to conditional generation. The project page is available at https://byungki-k.github.io/JointDiT/.

Problem

Research questions and friction points this paper is trying to address.

Modeling joint distribution of RGB and depth data

Generating high-fidelity images with accurate depth maps

Enabling combinatorial tasks via adaptive noise-level techniques

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive scheduling weights for noise levels

Unbalanced timestep sampling strategy

Joint RGB-depth diffusion transformer modeling

🔎 Similar Papers

Diffusion-based RGB-D Semantic Segmentation with Deformable Attention Transformer