Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation

📅 2025-12-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Transparent objects violate conventional depth estimation assumptions due to refraction, reflection, and transmission, causing occlusions and temporal instability in monocular methods. To address this, we propose the first zero-shot video-aware framework for transparent object reconstruction. Our method uncovers implicit optical priors of transparency encoded in video diffusion transformers (DiTs) and introduces a lightweight LoRA-driven architecture for video-to-depth/normal translation. We further construct TransPhy3D—the first large-scale, physically realistic transparent video dataset—generated via OptiX-accelerated Blender/Cycles rendering. Our approach jointly trains RGB and noisy depth latent spaces, enabling robust spatiotemporal modeling. Evaluated on ClearPose, DREDS, and TransPhy3D-Test, it surpasses all state-of-the-art methods with significantly improved temporal stability. A 1.3B-parameter model achieves 0.17 s/frame inference speed. Integrated into a robotic grasping system, it boosts cross-material (transparent, reflective, diffuse) manipulation success rates substantially.

Technology Category

Application Category

📝 Abstract
Transparent objects remain notoriously hard for perception systems: refraction, reflection and transmission break the assumptions behind stereo, ToF and purely discriminative monocular depth, causing holes and temporally unstable estimates. Our key observation is that modern video diffusion models already synthesize convincing transparent phenomena, suggesting they have internalized the optical rules. We build TransPhy3D, a synthetic video corpus of transparent/reflective scenes: 11k sequences rendered with Blender/Cycles. Scenes are assembled from a curated bank of category-rich static assets and shape-rich procedural assets paired with glass/plastic/metal materials. We render RGB + depth + normals with physically based ray tracing and OptiX denoising. Starting from a large video diffusion model, we learn a video-to-video translator for depth (and normals) via lightweight LoRA adapters. During training we concatenate RGB and (noisy) depth latents in the DiT backbone and co-train on TransPhy3D and existing frame-wise synthetic datasets, yielding temporally consistent predictions for arbitrary-length input videos. The resulting model, DKT, achieves zero-shot SOTA on real and synthetic video benchmarks involving transparency: ClearPose, DREDS (CatKnown/CatNovel), and TransPhy3D-Test. It improves accuracy and temporal consistency over strong image/video baselines, and a normal variant sets the best video normal estimation results on ClearPose. A compact 1.3B version runs at ~0.17 s/frame. Integrated into a grasping stack, DKT's depth boosts success rates across translucent, reflective and diffuse surfaces, outperforming prior estimators. Together, these results support a broader claim: "Diffusion knows transparency." Generative video priors can be repurposed, efficiently and label-free, into robust, temporally coherent perception for challenging real-world manipulation.
Problem

Research questions and friction points this paper is trying to address.

Estimating depth and normals for transparent objects in videos
Addressing refraction and reflection issues in perception systems
Achieving temporal consistency in transparent object depth estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Repurposes video diffusion models for depth estimation
Uses synthetic transparent video corpus for training
Employs lightweight LoRA adapters for video-to-video translation
🔎 Similar Papers
No similar papers found.
Shaocong Xu
Shaocong Xu
Xiamen University
open-set perceptionvision-language perceptiondiffusion-based perceptionmachine learning
Songlin Wei
Songlin Wei
University of Southern California, (Previously) Peking University
Robotics3D Vision
Q
Qizhe Wei
Beijing Academy of Artificial Intelligence
Z
Zheng Geng
Beijing Academy of Artificial Intelligence
H
Hong Li
Beijing Academy of Artificial Intelligence, Beihang University
L
Licheng Shen
Tsinghua University
Q
Qianpu Sun
Tsinghua University
Shu Han
Shu Han
Yeshiva University
Information Systems
B
Bin Ma
Tsinghua University
B
Bohan Li
Shanghai Jiao Tong University, European Institute of Innovation and Technology Ningbo
Chongjie Ye
Chongjie Ye
The Chinese University of Hong Kong, Shenzhen
Computer Vision
Yuhang Zheng
Yuhang Zheng
NUS; TARS AI
Robotics3D Vision
N
Nan Wang
Beijing Academy of Artificial Intelligence
Saining Zhang
Saining Zhang
College of Computing and Data Science, Nanyang Technological University
Computer Vision
H
Hao Zhao
Tsinghua University