🤖 AI Summary
This work addresses the challenge in RGB-T semantic segmentation caused by the scarcity of strictly aligned visible-infrared-label triplets and the inability of existing generation methods to ensure cross-modal spatial alignment and semantic consistency. To overcome this, we propose UniTriGen, the first unified triplet generation framework that jointly models all three modalities through a shared latent-space diffusion model, augmented with lightweight modality-specific residual adapters to preserve fine-grained details. Generation is guided by textual prompts, and a scene-balanced, class-aware few-shot sampling strategy is introduced to enhance diversity and semantic fidelity. Using only a small amount of real paired data, UniTriGen produces high-quality, semantically consistent, and modality-complementary triplets, significantly boosting the performance of various RGB-T segmentation models.
📝 Abstract
RGB-T semantic segmentation requires strictly aligned VIS-IR-Label triplets; however, such aligned triplet data are often scarce in real-world scenarios. Existing generative augmentation methods usually adopt cascaded generation paradigms, decomposing joint triplet generation into local conditional processes. As a result, consistency among VIS, IR, and Label in spatial structure, semantic content, and cross-modal details cannot be reliably maintained. To address this issue, we propose UniTriGen, a unified triplet generation framework that directly generates spatially aligned, semantically consistent, and modality complementary VIS-IR-Label triplets under the guidance of text prompts. UniTriGen first introduces a unified triplet generation mechanism, where VIS, IR, and Label are jointly encoded into a shared latent space and modeled with a diffusion process to enforce global cross-modal consistency. Lightweight modality-specific residual adapters are further integrated into this mechanism to accommodate modality-specific imaging characteristics and output formats. To mitigate generation bias caused by imbalanced scene and class distributions in limited paired triplets, UniTriGen also employs a scene-balanced and class-aware few-shot sampling strategy, which induces a more balanced sampling distribution and enhances the scene and class diversity of generated triplets. Experiments show that UniTriGen generates high-quality aligned triplets from limited real paired data, thereby achieving consistent performance improvements across various RGB-T semantic segmentation models.