🤖 AI Summary
This work addresses the challenge in modality translation tasks, where the highly under-constrained solution space traditionally necessitates fully paired data, thereby limiting real-world applicability. To overcome this limitation, the authors propose a Structured Diffusion Bridge framework that reframes paired supervision from a strict requirement into an optional alignment constraint. By incorporating structural inductive biases, the method flexibly models cross-modal mappings across unpaired, semi-paired, and fully paired settings. Extensive experiments on both synthetic and real-world benchmarks demonstrate its robustness: even with substantially reduced reliance on paired data, the approach achieves generation quality approaching that of fully supervised methods, significantly enhancing the practicality and generalization capability of modality translation systems.
📝 Abstract
Modality translation is inherently under-constrained, as multiple cross-modal mappings may yield the same marginals. Recent work has shown that diffusion bridges are effective for this task. However, most existing approaches rely on fully paired datasets, thereby imposing a single data-driven constraint. We propose a diffusion-bridge framework that characterizes the space of admissible solutions and restricts it via alignment constraints, treating paired supervision as an optional heuristic rather than a prerequisite. We validate our method on synthetic and real modality translation benchmarks across unpaired, semi-paired, and paired regimes, showing consistent performance across supervision levels. Notably, \textbf{it achieves near fully-paired quality with a substantial relaxation in pairing requirements, and remaining applicable in the unpaired regime}. These results highlight diffusion bridges as a flexible foundation for modality translation beyond fully paired data.