🤖 AI Summary
In colonoscopy depth estimation, acquiring accurate ground-truth depth is challenging, and existing sim-to-real image translation methods often suffer from structural distortion and unrealistic texture. To address these issues, this paper proposes a lighting-aware, structure-constrained controllable image translation method. We introduce per-pixel shading (PPS) maps—derived from illumination modeling—as geometric priors into ControlNet conditioning, offering more robust structural guidance than conventional depth maps. Our approach synergistically integrates Stable Diffusion to jointly optimize structural fidelity and textural realism. The resulting end-to-end differentiable sim-to-real translation framework significantly outperforms MI-CycleGAN on colonoscopy depth estimation: it achieves higher structural consistency and more photorealistic texture in translated images, reducing depth prediction error by 18.7%. Code is publicly available.
📝 Abstract
Accurate depth estimation enhances endoscopy navigation and diagnostics, but obtaining ground-truth depth in clinical settings is challenging. Synthetic datasets are often used for training, yet the domain gap limits generalization to real data. We propose a novel image-to-image translation framework that preserves structure while generating realistic textures from clinical data. Our key innovation integrates Stable Diffusion with ControlNet, conditioned on a latent representation extracted from a Per-Pixel Shading (PPS) map. PPS captures surface lighting effects, providing a stronger structural constraint than depth maps. Experiments show our approach produces more realistic translations and improves depth estimation over GAN-based MI-CycleGAN. Our code is publicly accessible at https://github.com/anaxqx/PPS-Ctrl.