🤖 AI Summary
This work addresses the challenge that existing image denoising methods often conflate high-frequency textures with random noise, leading to either detail loss or residual noise. To resolve this, we propose a causal intervention–based orthogonal content–noise disentanglement framework that explicitly separates the generative mechanisms of image content and noise within a Vision Transformer architecture. By integrating environment bias correction, dual-branch orthogonality constraints, and causal priors guided by an external generative model (Nano Banana Pro), our approach effectively eliminates spurious correlations between content and noise. Extensive experiments demonstrate that the method outperforms state-of-the-art algorithms across multiple benchmarks, achieving both high-fidelity reconstruction and efficient inference—reaching 104.2 FPS on a single RTX 5090 GPU.
📝 Abstract
Conventional image denoising models often inadvertently learn spurious correlations between environmental factors and noise patterns. Moreover, due to high-frequency ambiguity, they struggle to reliably distinguish subtle textures from stochastic noise, resulting in over-removed details or residual noise artifacts. We therefore revisit denoising via causal intervention, arguing that purely correlational fitting entangles intrinsic content with extrinsic noise, which directly degrades robustness under distribution shifts. Motivated by this, we propose the Teacher-Guided Causal Disentanglement Network (TCD-Net), which explicitly decomposes the generative mechanism via structured interventions on feature spaces within a Vision Transformer framework. Specifically, our method integrates three key components: (1) An Environmental Bias Adjustment (EBA) module projects features into a stable, de-centered subspace to suppress global environmental bias (de-confounding). (2) A dual-branch disentanglement head employs an orthogonality constraint to force a strict separation between content and noise representations, preventing information leakage. (3) To resolve structural ambiguity, we leverage Nano Banana Pro, Google's reasoning-guided AI image generation model, to guide a causal prior, effectively pulling content representations back onto the natural-image manifold. Extensive experiments demonstrate that TCD-Net outperforms mainstream methods across multiple benchmarks in both fidelity and efficiency, achieving a real-time speed of 104.2 FPS on a single RTX 5090 GPU.